See also

In order to follow this section, you need to understand the CUDA programming model.

Lockstep Programming Model

Section author: René Widera, Axel Huebl

The lockstep programming model structures code that is evaluated collectively and independently by workers (physical threads). Actual processing is described by one-dimensional index domains of virtual workers which can even be changed within a kernel. Mathematically, index domains are none-injective, total functions on physical workers.

An index domain is independent from data but can be mapped to a data domain, e.g. one to one or with more complex mappings.

Code which is implemented by the lockstep programming model is free of any dependencies between the number of worker and processed data elements. To simplify the implementation, each index within a domain can be seen as a virtual worker which is processing one data element (like the common workflow to programming CUDA). Each worker \(i\) can be executed as \(N_i\) virtual workers (\(1:N_i\)).

pmacc helpers

template<uint32_t T_domainSize, uint32_t T_workerSize, uint32_t T_simdSize = 1u>
struct IdxConfig

describe a constant index domain

describe the size of the index domain and the number of workers to operate on the domain

Template Parameters
  • T_domainSize: number of indices in the domain

  • T_workerSize: number of worker working on T_domainSize

  • T_simdSize: SIMD width

template<typename T_Type, typename T_IdxConfig>
struct CtxArray : public pmacc::memory::Array<T_Type, T_IdxConfig::numCollIter * T_IdxConfig::simdSize>, public T_IdxConfig

Static sized array for a local variable.

The array is designed to hold context variables in lock step programming. A context variable is just a local variable of a virtual worker. Allocating and using a context array allows to propagate virtual worker states over subsequent lock steps. A context array for a set of virtual workers is owned by their (physical) worker.

The number of elements depends on the index domain size and the number of workers to process the indices.

template<typename T_IdxConfig>
struct ForEachIdx : public T_IdxConfig

execute a functor for each index

Distribute the indices even over all worker and execute a user defined functor. There is no guarantee in which order the indices will be processed.

Template Parameters
  • T_IdxConfig: index domain description

Common Patterns

Collective Loop

  • each worker needs to pass a loop N times

  • in this example, there are more dates than workers that process them

// `frame` is a list which must be traversed collectively
while( frame.isValid() )
{
    uint32_t const workerIdx = threadIdx.x;
    using ParticleDomCfg = IdxConfig<
        frameSize,
        numWorker
    >;
    ForEachIdx< ParticleDomCfg > forEachParticle( workerIdx );
    forEachParticle(
       [&]( uint32_t const linearIdx, uint32_t const idx )
       {
           // independent work
       }
   );
}

Non-Collective Loop

  • each virtual worker increments a private variable

uint32_t const workerIdx = threadIdx.x;
using ParticleDomCfg = IdxConfig<
    frameSize,
    numWorkers
>;
ForEachIdx< ParticleDomCfg > forEachParticle( workerIdx );
memory::CtxArray< int, ParticleDomCfg > vWorkerIdx( 0 );
forEachParticle(
    [&]( uint32_t const linearIdx, uint32_t const idx )
    {
        vWorkerIdx[ idx ] = linearIdx;
        for( int i = 0; i < 100; i++ )
            vWorkerIdx[ idx ]++;
    }
);

Create a Context Variable

  • … and initialize with the index of the virtual worker

uint32_t const workerIdx = threadIdx.x;
using ParticleDomCfg = IdxConfig<
    frameSize,
    numWorkers
>;
memory::CtxArray< int, ParticleDomCfg > vIdx(
    workerIdx,
    [&]( uint32_t const linearIdx, uint32_t const ) -> int32_t
    {
        return linearIdx;
    }
);

// is equal to

memory::CtxArray< int, ParticleDomCfg > vIdx;
ForEachIdx< ParticleDomCfg > forEachParticle{ workerIdx }(
    [&]( uint32_t const linearIdx, uint32_t const idx )
    {
        vIdx[ idx ] = linearIdx;
    }
);

Using a Master Worker

  • only one virtual worker (called master) of all available numWorkers manipulates a shared data structure for all others

// example: allocate shared memory (uninitialized)
PMACC_SMEM(
    finished,
    bool
);

uint32_t const workerIdx = threadIdx.x;
ForEachIdx<
    IdxConfig<
        1,
        numWorkers
    >
> onlyMaster{ workerIdx };

// manipulate shared memory
onlyMaster(
    [&](
        uint32_t const,
        uint32_t const
    )
    {
        finished = true;
    }
);

/* important: synchronize now, in case upcoming operations (with
 * other workers) access that manipulated shared memory section
 */
__syncthreads();