reducers under nested iterations

Hello, I am doing some SpMV-related work and exploring the use of CilkPlus. I had a question related to reducers that I could not find out myself reading the documentation. In short: is there a simple or performant way of declaring a logical set of reducers or a reducer 'holder' such that an inner cilk_for uses its own reducer hyperobject, without the outer cilk_for having to share the same hyperobject over all of its strands.

Consider the following C99-CilkPlus loop code, which calculates a sparse binary matrix-vector multiplications for eight vectors simultaneously:

  cilk_for(int row = 0; row < A->nrow; row++) {
    double tmp[8] = {0};
    for (int i = row_ptr[row]; i < row_ptr[row + 1]; i++) {
      int col = cols[i] << 3;
      tmp[:] += X[col:8];
    }
    int r = row << 3;
    Y[r:8] = tmp[:];
  }

Now consider the case where one would want to futher parallelize the inner loop. Now, even in OpenMP4.0 I get into trouble here, as I cannot declare an array in #pragma omp reduction(+:tmp). Similarly, I cannot use the built-in opadd reducer in CilkPlus as double[8] is not a simple numeric datatype, so I create my custom add reducer for a double[8] vector:

void reduce_vecsum(void* reducer, void* left, void* right) {
  vdp8* vl = (vdp8*)left;
  vdp8* vr = (vdp8*)right;
  (*vl)[:] += (*vr)[:];
}
void identity_vecsum(void* reducer, void* v) {
  (*(vdp8*)v)[:] = 0;
}
CILK_C_DECLARE_REDUCER(vdp8) cilk_c_vecsum_reducer =
  CILK_C_INIT_REDUCER(vdp8,
                      reduce_vecsum,
                      identity_vecsum,
                      __cilkrts_hyperobject_noop_destroy,
                      {0,0,0,0,0,0,0,0});
vdp8* vecsum_view() {
  return (vdp8*)REDUCER_VIEW(cilk_c_vecsum_reducer);
}

The big question is, considering the determinacy guarantees of reducers, would it be correct to do the following:

  CILK_C_REGISTER_REDUCER(cilk_c_vecsum_reducer); //runtime starts managing thread-local views, no need for manual tmp[]
  cilk_for (int row = 0; row < A->nrow; row++) {
    vdp8 *vsum;
    vsum = vecsum_view();
    (*vsum)[:] = 0;
    cilk_for (int i = row_ptr[row]; i < row_ptr[row + 1]; i++) {
      vdp8 *vtmp;
      int col = cols[i] << 3; // multiplying with 8
      vtmp = vecsum_view(); // grab a local view, will reduce automatically on task/strand joins
      (*vtmp)[:] += X[col:8];
    }
    int r = row << 3;
    vsum = vecsum_view(); // grab the sum
    Y[r:8] = (*vsum)[:];  // commit to the output vector
    // note: outer cilk_for will perform further reductions, although we do not need the result
  }
  CILK_C_UNREGISTER_REDUCER(cilk_c_vecsum_reducer);

My worry is that a steal of strands from the inner cilk_for might cause the sums of two different rows to become mingled. The secondary worry is the overhead of performing superfluous reductions of the reducer at the joins of the outer cilk_for loop. In other words, is the above correct and is there a better way of doing something similar?

note: I am using the above as an example, in reality the inner loop is complete overkill. However, I am working on a blocked version which does display the same nested loop structure, with the same need to reduce on the output vector.

reducers under nested iterations

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112