Hello, I am doing some SpMV-related work and exploring the use of CilkPlus. I had a question related to reducers that I could not find out myself reading the documentation. In short: is there a simple or performant way of declaring a logical set of reducers or a reducer 'holder' such that an inner cilk_for uses its own reducer hyperobject, without the outer cilk_for having to share the same hyperobject over all of its strands.
Consider the following C99-CilkPlus loop code, which calculates a sparse binary matrix-vector multiplications for eight vectors simultaneously:
cilk_for(int row = 0; row < A->nrow; row++) { double tmp[8] = {0}; for (int i = row_ptr[row]; i < row_ptr[row + 1]; i++) { int col = cols[i] << 3; tmp[:] += X[col:8]; } int r = row << 3; Y[r:8] = tmp[:]; }
Now consider the case where one would want to futher parallelize the inner loop. Now, even in OpenMP4.0 I get into trouble here, as I cannot declare an array in #pragma omp reduction(+:tmp). Similarly, I cannot use the built-in opadd reducer in CilkPlus as double[8] is not a simple numeric datatype, so I create my custom add reducer for a double[8] vector:
void reduce_vecsum(void* reducer, void* left, void* right) { vdp8* vl = (vdp8*)left; vdp8* vr = (vdp8*)right; (*vl)[:] += (*vr)[:]; } void identity_vecsum(void* reducer, void* v) { (*(vdp8*)v)[:] = 0; } CILK_C_DECLARE_REDUCER(vdp8) cilk_c_vecsum_reducer = CILK_C_INIT_REDUCER(vdp8, reduce_vecsum, identity_vecsum, __cilkrts_hyperobject_noop_destroy, {0,0,0,0,0,0,0,0}); vdp8* vecsum_view() { return (vdp8*)REDUCER_VIEW(cilk_c_vecsum_reducer); }
The big question is, considering the determinacy guarantees of reducers, would it be correct to do the following:
CILK_C_REGISTER_REDUCER(cilk_c_vecsum_reducer); //runtime starts managing thread-local views, no need for manual tmp[] cilk_for (int row = 0; row < A->nrow; row++) { vdp8 *vsum; vsum = vecsum_view(); (*vsum)[:] = 0; cilk_for (int i = row_ptr[row]; i < row_ptr[row + 1]; i++) { vdp8 *vtmp; int col = cols[i] << 3; // multiplying with 8 vtmp = vecsum_view(); // grab a local view, will reduce automatically on task/strand joins (*vtmp)[:] += X[col:8]; } int r = row << 3; vsum = vecsum_view(); // grab the sum Y[r:8] = (*vsum)[:]; // commit to the output vector // note: outer cilk_for will perform further reductions, although we do not need the result } CILK_C_UNREGISTER_REDUCER(cilk_c_vecsum_reducer);
My worry is that a steal of strands from the inner cilk_for might cause the sums of two different rows to become mingled. The secondary worry is the overhead of performing superfluous reductions of the reducer at the joins of the outer cilk_for loop. In other words, is the above correct and is there a better way of doing something similar?
note: I am using the above as an example, in reality the inner loop is complete overkill. However, I am working on a blocked version which does display the same nested loop structure, with the same need to reduce on the output vector.