Device Oversubscription

Module author: René Widera

By default the strategy to execute PIConGPU is that one MPI rank is using a single device e.g. a GPU. In some situation it could be beneficial to use multiple MPI ranks per device to get a better load balancing or better overlap communications with computation.

Usage

Follow the description to pass command line parameter to PIConGPU. PIConGPU provides the command line parameter --numRanksPerDevice or short -r to allow sharing a compute device between multiple MPI ranks. If you change the default value 1 to 2 PIConGPU is supporting two MPI processes per device.

Note

Using device oversubscription will limit the maximal memory footprint per PIConGPU MPI rank on the device too <total available memory on device>/<number of ranks per device>.

NVIDIA

Compute Mode

On NVIDIA GPUs there are different point which can influence the oversubscription of a device/GPU. NVIDIA Compute Mode must be `Default to allow multiple processes to use a single GPU. If you use device oversubscription with NVIDIA GPUs the kernel executed from different processes will be serialized by the driver, this is mostly describing the performance of PIConGPU because the device is under utilized.

Multi-Process Service (MPS)

If you use NVIDIA MPS and split one device into 4 you need to use --numRanksPerDevice 4 for PIConGPU even if MPS is providing you with 4 virtual gpus. MPS can be used to workaround the kernel serialization when using multiple processes per GPU.

CPU

If you compiled PIConGPU with a CPU accelerator e.g. omp2b, serial, tbb, or threads device oversubscribing will have no effect. For CPU accelerators PIConGPU is not using a pre allocated device memory heap therefore you can freely choose the number of MPI ranks per CPU.