Lockstep Programming Model

Section author: René Widera, Axel Huebl

The lockstep programming model structures code that is evaluated collectively and independently by workers (physical threads) within a alpaka block. Actual processing is described by one-dimensional index domains which are known compile time and can even be changed within a kernel.

An index domain is independent of the data but can be mapped to a data domain, e.g. one to one or with more complex mappings. A index domain is processed collectively by all workers.

Code which is implemented by the lockstep programming model is free of any dependencies between the number of worker and processed data elements. To simplify the implementation, each index within a domain can be mapped to a single data element (like the common workflow to programming CUDA). But even within this simplified picture one real worker (i.e. physical thread) could still be assigned the workload of any number of domain indices.

Functors passed into lockstep routines can have three different base parameter signatures. Additionally each case can be extended by an arbitrary number parameters to get access to context variables.

No parameter, if the work is not requiring the linear index within a domain: [&](){ }
An unsigned 32bit integral parameter if the work depends on indices within the domain range [0,domain size): [&](uint32_t const linearIdx){}
lockstep::Idx as parameter. lockstep::Idx is holding the linear index within the domain and meta information to access a context variables: [&](pmacc::mappings::threads::lockstep::Idx const idx){}

Context variables, over worker distributed arrays, can be passed as additional arguments to the lockstep foreach. The corresponding data for each index element of the domain will be passed as additional argument to the lambda function.

The naming used for methods or members

*DomSize is the index domain size as scalar value, typically an integral type
*DomSizeND is the N-dimensional index domain size, typically of the type pmacc::math::Vector<> or pmacc::math::CT:Vector
*DomIdx is the index domain element as scalar value, typically an integral type
*DomIdxND is the N-dimensional index domain element, typically of the type pmacc::math::Vector<>
*Size is the size of data as scalar value
*SizeND is the N-dimensional data size, typically of the type pmacc::math::Vector<>

pmacc helpers

template<uint32_t T_domainSize, uint32_t T_numWorkers, uint32_t T_simdSize> struct Config

describe a constant index domain

describe the size of the index domain and the number of workers to operate on a lockstep domain

Template Parameters:

T_domainSize – number of indices in the domain
T_numWorkers – number of worker working on T_domainSize
T_simdSize – SIMD width

struct Idx: Hold current index within a lockstep domain.

template<typename T_Acc, typename T_BlockCfg> class Worker

Entity of an worker.

Context object used for lockstep programming. This object is providing access to the alpaka accelerator and indicies used for the lockstep programming model.

Template Parameters:: T_numSuggestedWorkers – Suggested number of lockstep workers. Do not assume that the suggested number of workers is used within the kernel. The real used number of worker can be queried with numWorkers() or via the member variable numWorkers.

template<typename T_Type, typename T_Config> struct Variable : protected pmacc::memory::Array<T_Type, T_Config::maxIndicesPerWorker>, public T_Config 

Variable used by virtual worker.

This object is designed to hold context variables in lock step programming. A context variable is just a local variable of a virtual worker. Allocating and using a context variable allows to propagate virtual worker states over subsequent lock steps. A context variable for a set of virtual workers is owned by their (physical) worker.

Data stored in a context variable should only be used with a lockstep programming construct e.g. lockstep::ForEach<>

template<typename T_Worker, typename T_Config> class ForEach

Execute a functor for the given index domain.

Algorithm to execute a subsequent lockstep for each index of the configured domain.

Attention: There is no implicit synchronization between workers before or after the execution of is ForEach performed.

Template Parameters:: T_Config – Configuration for the domain and execution strategy. T_Config must provide: domainSize, numCollIter, numWorkers, and simdSize at compile time.

Common Patterns

Create a Context Variable

A context variable is used to transfer information from a subsequent lockstep to another. You can use a context variable lockstep::Variable, similar to a temporary local variable in a function. A context variable must be defined outside of ForEach and should be accessed within the functor passed to ForEach only.

… and initialize with the index of the domain element

// variable 'worker' is provided by pmacc if the kernel launch macro `PMACC_LOCKSTEP_KERNEL()` is used.
constexpr uint32_t frameSize = 256;
auto forEachParticleSlotInFrame = lockstep::makeForEach<frameSize>(worker);
auto elemIdx = forEachParticleSlotInFrame(
    [](lockstep::Idx const idx) -> int32_t
    {
        return idx;
    }
);

// is equal to

// assume one dimensional indexing of threads within a block
constexpr uint32_t frameSize = 256;
auto forEachParticleSlotInFrame = lockstep::makeForEach<frameSize>(worker);
// variable will be uninitialized
auto elemIdx = lockstep::makeVar<int32_t>(forEachParticleSlotInFrame);
forEachParticleSlotInFrame(
    [&](uint32_t const idx, auto& vIndex)
    {
        vIndex = idx;
    },
    elemIdx
);
// is equal to
forEachParticleSlotInFrame(
    [&](lockstep::Idx const idx)
    {
        elemIdx[idx] = idx;
    }
);

To default initialize a context variable you can pass the arguments directly during the creation.

// variable 'worker' is provided by pmacc if the kernel launch macro `PMACC_LOCKSTEP_KERNEL()` is used.
constexpr uint32_t frameSize = 256;
auto forEachParticleSlotInFrame = lockstep::makeForEach<frameSize>(worker);
auto var = lockstep::makeVar<int32_t>(forEachParticleSlotInFrame, 23);

Data from a context variable can be accessed within independent lock steps. Only data elements those correspond to the element index of the domain can be accessed.

// variable 'worker' is provided by pmacc if the kernel launch macro `PMACC_LOCKSTEP_KERNEL()` is used.
constexpr uint32_t frameSize = 256;
auto forEachParticleSlotInFrame = lockstep::makeForEach<frameSize>(worker);
auto elemIdx = forEachParticleSlotInFrame(
    [](uint32_t const idx) -> int32_t
    {
        return idx;
    }
);

// store old linear index into oldElemIdx
auto oldElemIdx = forEachExample(
    [&](lockstep::Idx const idx) -> int32_t
    {
        int32_t old = elemIdx[idx];
        printf("domain element idx: %u == %u\n", elemIdx[idx], idx);
        elemIdx[idx] += 256;
        return old;
    }
);

// To avoid convusion between read-only and read-write input variables we suggest using
// const for read only variables.
forEachExample(
    [&](lockstep::Idx const idx, int32_t const oldIndex, int32_t const vIndex)
    {
        printf("nothing changed: %u == %u - 256 == %u\n", oldIndex, vIndex, idx);
    },
    oldElemIdx,
    elemIdx
);

Collective Loop over particles

each worker needs to pass a loop N times
in this example, there are more dates than workers that process them

// variable 'worker' is provided by pmacc if the kernel launch macro `PMACC_LOCKSTEP_KERNEL()` is used.
// `frame` is a list which must be traversed collectively
while( frame.isValid() )
{
    // assume one dimensional indexing of threads within a block
    constexpr uint32_t frameSize = 256;
    auto forEachParticleSlotInFrame = lockstep::makeForEach<frameSize>(worker);
    forEachParticleSlotInFrame(
       [&](lockstep::Idx const idx)
       {
           // independent work, idx can be used to access a context variable
       }
    forEachParticleSlotInFrame(
       [&](uint32_t const linearIdx)
       {
           // independent work based on the linear index only, e.g. shared memory access
       }
   );
}

Non-Collective Loop over particles

each element index of the domain increments a private variable

// variable 'worker' is provided by pmacc if the kernel launch macro `PMACC_LOCKSTEP_KERNEL()` is used.
constexpr uint32_t frameSize = 256;
auto forEachParticleSlotInFrame = lockstep::makeForEach<frameSize>(worker);
auto vWorkerIdx = lockstep::makeVar<int32_t>(forEachParticleSlotInFrame, 0);
forEachParticleSlotInFrame(
    [&](auto const idx, int32_t& vWorker)
    {
        // assign the linear element index to context variable
        vWorker = idx;
        for(int i = 0; i < 100; i++)
            vWorker++;
    },
    vWorkerIdx
);

Using a Master Worker

only a single element index of the domain (called master) manipulates a shared data structure for all others

// example: allocate shared memory (uninitialized)
PMACC_SMEM(
    finished,
    bool
);

// variable 'worker' is provided by pmacc if the kernel launch macro `PMACC_LOCKSTEP_KERNEL()` is used.
auto onlyMaster = lockstep::makeMaster(worker);

// manipulate shared memory
onlyMaster(
    [&]( )
    {
        finished = true;
    }
);

/* important: synchronize now, in case upcoming operations (with
 * other workers) access that manipulated shared memory section
 */
worker.sync();

Practical Examples

If possible kernels should be written without assuming any lockstep domain size and number of alpaka blocks selected at the kernel start. This ensure that the kernel results are always correct even if the user doesn’t chose the right parameters for the kernel execution.

struct IotaGenericKernel
{
    template<typename T_Worker, typename T_DataBox>
    HDINLINE void operator()(T_Worker const& worker, T_DataBox data, uint32_t size) const
    {
        constexpr uint32_t blockDomSize = T_Worker::blockDomSize();
        auto numDataBlocks = (size + blockDomSize - 1u) / blockDomSize;

        // grid-strided loop over the chunked data
        for(int dataBlock = worker.blockDomIdx(); dataBlock < numDataBlocks; dataBlock += worker.gridDomSize())
        {
            auto dataBlockOffset = dataBlock * blockDomSize;
            auto forEach = pmacc::lockstep::makeForEach(worker);
            forEach(
                [&](uint32_t const inBlockIdx)
                {
                    auto idx = dataBlockOffset + inBlockIdx;
                    if(idx < size)
                    {
                        // ensure that each block is not overwriting data from other blocks
                        PMACC_DEVICE_VERIFY_MSG(data[idx] == 0u, "%s\n", "Result buffer not valid initialized!");
                        data[idx] = idx;
                    }
                });
        }
    }
};

template<uint32_t T_chunkSize, typename T_DeviceBuffer>
inline void iotaGerneric(T_DeviceBuffer& devBuffer)
{
    auto bufferSize = devBuffer.size();
    // use only half of the blocks needed to process the full data
    uint32_t const numBlocks = bufferSize / T_chunkSize / 2u;
    PMACC_LOCKSTEP_KERNEL(IotaGenericKernel{}).config<T_chunkSize>(numBlocks)(devBuffer.getDataBox(), bufferSize);
}

The block domain size can also be derived from a instance of any object if the trait pmacc::lockstep::traits::MakeBlockCfg is defined.

namespace pmacc::lockstep::traits
{
    //! Specialization to create a lockstep block configuration out of a device buffer.
    template<>
    struct MakeBlockCfg<pmacc::DeviceBuffer<uint32_t, DIM1>> : std::true_type
    {
        using type = BlockCfg<math::CT::UInt32<53>>;
    };
} // namespace pmacc::lockstep::traits

template<typename T_DeviceBuffer>
inline void iotaGernericBufferDerivedChunksize(T_DeviceBuffer& devBuffer)
{
    auto bufferSize = devBuffer.size();
    constexpr uint32_t numBlocks = 9;
    PMACC_LOCKSTEP_KERNEL(IotaGenericKernel{}).config(numBlocks, devBuffer)(devBuffer.getDataBox(), bufferSize);
}

Sometimes it is not possible to write a generic kernel and a hard coded block domain size is required to fulfill stencil condition or other requirements. In this case it is possible to use on device pmacc::lockstep::makeForEach<hardCodedBlockDomSize>(worker). The problem is that the user needs to know this hard coded requirement during the kernel call else it could be the kernel is running slow. It is possible that too many worker threads are idling during the execution because the selected block domain during the kernel call is larger than the required block domain within the kernel. By defining the member variable blockDomSize and not providing the block domain size during the kernel configuration the kernel will be executed automatically with the block domain size specialized by the kernel. Overwriting the block domain size during the kernel execution is triggering a static assertion during compiling.

struct IotaFixedChunkSizeKernel
{
    static constexpr uint32_t blockDomSize = 42;

    template<typename T_Worker, typename T_DataBox>
    HDINLINE void operator()(T_Worker const& worker, T_DataBox data, uint32_t size) const
    {
        static_assert(blockDomSize == T_Worker::blockDomSize());

        auto numDataBlocks = (size + blockDomSize - 1u) / blockDomSize;

        // grid-strided loop over the chunked data
        for(int dataBlock = worker.blockDomIdx(); dataBlock < numDataBlocks; dataBlock += worker.gridDomSize())
        {
            auto dataBlockOffset = dataBlock * blockDomSize;
            auto forEach = pmacc::lockstep::makeForEach(worker);
            forEach(
                [&](uint32_t const inBlockIdx)
                {
                    auto idx = dataBlockOffset + inBlockIdx;
                    if(idx < size)
                    {
                        // ensure that each block is not overwriting data from other blocks
                        PMACC_DEVICE_VERIFY_MSG(data[idx] == 0u, "%s\n", "Result buffer not valid initialized!");
                        data[idx] = idx;
                    }
                });
        }
    }
};

template<typename T_DeviceBuffer>
inline void iotaFixedChunkSize(T_DeviceBuffer& devBuffer)
{
    auto bufferSize = devBuffer.size();
    constexpr uint32_t numBlocks = 10;
    PMACC_LOCKSTEP_KERNEL(IotaFixedChunkSizeKernel{}).config(numBlocks)(devBuffer.getDataBox(), bufferSize);
}

Equally to the scalar block domain size blockDomSize a member type BlockDomSizeND of the pmacc type pmacc::math::CT::Uint32<> can be defined to express a N-dimensional block domain. blockDomSize and BlockDomSizeND are mutual exclusive and can not be defined at the same time for a kernel.

struct IotaFixedChunkSizeKernelND
{
    using BlockDomSizeND = pmacc::math::CT::UInt32<42>;

    template<typename T_Worker, typename T_DataBox>
    HDINLINE void operator()(T_Worker const& worker, T_DataBox data, uint32_t size) const
    {
        static constexpr uint32_t blockDomSize = BlockDomSizeND::x::value;

        static_assert(blockDomSize == T_Worker::blockDomSize());

        // grid-strided loop over the chunked data
        auto numDataBlocks = (size + blockDomSize - 1u) / blockDomSize;

        for(int dataBlock = worker.blockDomIdx(); dataBlock < numDataBlocks; dataBlock += worker.gridDomSize())
        {
            auto dataBlockOffset = dataBlock * blockDomSize;
            auto forEach = pmacc::lockstep::makeForEach(worker);
            forEach(
                [&](uint32_t const inBlockIdx)
                {
                    auto idx = dataBlockOffset + inBlockIdx;
                    if(idx < size)
                    {
                        // ensure that each block is not overwriting data from other blocks
                        PMACC_DEVICE_VERIFY_MSG(data[idx] == 0u, "%s\n", "Result buffer not valid initialized!");
                        data[idx] = idx;
                    }
                });
        }
    }
};

template<typename T_DeviceBuffer>
inline void iotaFixedChunkSizeND(T_DeviceBuffer& devBuffer)
{
    auto bufferSize = devBuffer.size();
    constexpr uint32_t numBlocks = 11;
    PMACC_LOCKSTEP_KERNEL(IotaFixedChunkSizeKernelND{}).config(numBlocks)(devBuffer.getDataBox(), bufferSize);
}

To use dynamic shared memory within a lockstep kernel the kernel must be configured with configSMem instead of config

struct IotaGenericKernelWithDynSharedMem
{
    template<typename T_Worker, typename T_DataBox>
    HDINLINE void operator()(T_Worker const& worker, T_DataBox data, uint32_t size) const
    {
        constexpr uint32_t blockDomSize = T_Worker::blockDomSize();
        auto numDataBlocks = (size + blockDomSize - 1u) / blockDomSize;

        uint32_t* s_mem = ::alpaka::getDynSharedMem<uint32_t>(worker.getAcc());

        // grid-strided loop over the chunked data
        for(int dataBlock = worker.blockDomIdx(); dataBlock < numDataBlocks; dataBlock += worker.gridDomSize())
        {
            auto dataBlockOffset = dataBlock * blockDomSize;
            auto forEach = pmacc::lockstep::makeForEach(worker);
            forEach(
                [&](uint32_t const inBlockIdx)
                {
                    auto idx = dataBlockOffset + inBlockIdx;
                    s_mem[inBlockIdx] = idx;
                    if(idx < size)
                    {
                        // ensure that each block is not overwriting data from other blocks
                        PMACC_DEVICE_VERIFY_MSG(data[idx] == 0u, "%s\n", "Result buffer not valid initialized!");
                        data[idx] = s_mem[inBlockIdx];
                    }
                });
        }
    }
};

template<uint32_t T_chunkSize, typename T_DeviceBuffer>
inline void iotaGernericWithDynSharedMem(T_DeviceBuffer& devBuffer)
{
    auto bufferSize = devBuffer.size();
    // use only half of the blocks needed to process the full data
    uint32_t const numBlocks = bufferSize / T_chunkSize / 2u;
    constexpr size_t requiredSharedMemBytes = T_chunkSize * sizeof(uint32_t);
    PMACC_LOCKSTEP_KERNEL(IotaGenericKernelWithDynSharedMem{})
        .configSMem<T_chunkSize>(numBlocks, requiredSharedMemBytes)(devBuffer.getDataBox(), bufferSize);
}