Skip to content

Added checks on output vector in mxv#401

Open
GiovaGa wants to merge 6 commits intodevelopfrom
400-dense_output_vector_not_checked_mxv
Open

Added checks on output vector in mxv#401
GiovaGa wants to merge 6 commits intodevelopfrom
400-dense_output_vector_not_checked_mxv

Conversation

@GiovaGa
Copy link
Collaborator

@GiovaGa GiovaGa commented Nov 12, 2025

Resolves #400

@anyzelman
Copy link
Member

The MR fixes it for reference and reference_omp, but I thought you mentioned all other backends also run into this issue @GiovaGa ?

@anyzelman anyzelman added the bug Something isn't working label Nov 24, 2025
@anyzelman anyzelman added this to the v0.8 milestone Nov 24, 2025
@GiovaGa
Copy link
Collaborator Author

GiovaGa commented Nov 25, 2025

You are definitely right. I have now fixed also for the nonblocking backend.
The other backends call the same internal function of the reference backend, so this should do it for them

@anyzelman anyzelman force-pushed the 400-dense_output_vector_not_checked_mxv branch from 778c24f to e4c0a49 Compare December 23, 2025 02:34
anyzelman
anyzelman previously approved these changes Dec 23, 2025
@anyzelman
Copy link
Member

Running CI, running all unit & smoke tests with LPF. Looks ready to merge if these are both OK. Concept release notes:

Prior to this MR, calling mxv or vxm with a dense descriptor while the input/output vector was not dense, would not yield ILLEGAL for all supported backends. This MR fixes that, and also ensures that ILLEGAL is returned when one of the vector masks is non-empty (size larger than 0) and sparse. Furthermore, this MR adds a unit tests to guard against regressions.

Thanks to @GiovaGa for spotting the bug and providing the fixes for the reference, reference_omp, and nonblocking backends!

Copy link
Collaborator Author

@GiovaGa GiovaGa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me.
My only observation is that it may make sense to add an assert in exec_tests to guarantee that indeed at least one vector is not dense, as going quickly through the test, this doesn't seem obvious (and the function exec_tests does not specify such precondition)

@GiovaGa
Copy link
Collaborator Author

GiovaGa commented Jan 6, 2026

Right now, the test fails when running with 16 processes (and only in that case)

@GiovaGa
Copy link
Collaborator Author

GiovaGa commented Jan 28, 2026

It seems that the test fails only if n is small enough. Using n = 1000 the test runs correctly on my machine. This is probably a bug with the bsp1d backend

@GiovaGa GiovaGa force-pushed the 400-dense_output_vector_not_checked_mxv branch from 49cf9db to 2dc77dd Compare January 28, 2026 10:27
@anyzelman
Copy link
Member

Example of a failed run:

$ /scratch/workspace/alp-build/install/bin/grbrun -b bsp1d -np 16 tests/unit/illegal_spmv_debug_bsp1d 148
This is functional test tests/unit/illegal_spmv_debug_bsp1d
Info: grb::init (BSP1D) called using 16 user processes.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: grb::init (reference) called.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Info: process mask is all-one, we therefore assume a single user process is present on this node and thus shall use aligned mode for memory allocations that are potentially touched by multiple threads.
Error: unexpected error code in grb::setElement (BSP1D): Uninterpretable error code detected, please notify the developers.. Please submit a bug report.
Error: unexpected error code in grb::setElement (BSP1D): Uninterpretable error code detected, please notify the developers.. Please submit a bug report.
Test batch 5-8: initialisation FAILED
Test batch 5-8: initialisation FAILED
Info: grb::finalize (bsp1d) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Launching test FAILED
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.
Info: grb::finalize (reference) called.

Any larger problem size no longer fails.

@anyzelman
Copy link
Member

anyzelman commented Feb 5, 2026

The following similarly fails (P=11, n=102 -- all larger n are OK):

/scratch/workspace/alp-build/install/bin/grbrun -b bsp1d -np 11 tests/unit/illegal_spmv_debug_bsp1d 102

@anyzelman
Copy link
Member

anyzelman commented Feb 5, 2026

More minimal one that fails:

void grb_program( const size_t &n, grb::RC &rc ) {
        grb::Vector< bool > out( n ) , out2( 2 * n );
        if( grb::setElement( out, true, 0 ) != grb::SUCCESS ) {
                std::cout << "FAILED\n";
                rc = grb::FAILED;
        } else {
                std::cout << "OK\n";
                rc = grb::SUCCESS;
        }
        std::cout << std::endl;
}

Failure here occurs "already" for P=7 and n=64 (this is the smallest P, n that causes failure). What is particularly interesting is that without declaring out2 is needed while the 2x size is also "mandatory"-- otherwise there would be no failure. This points to a shared buffer corruption or a more general memory corruption problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

grb::mxv<dense> with sparse output returns grb::SUCCESS

2 participants