Automated Parameter Scans using Snakemake
Snakemake is a python based workflow engine that can be used to automate compiling, running and post-processing of PIConGPU simulations, or any other workflow that can be represented as a directed acyclic graph (DAG).
Each workflow consists of a Snakefile
in which the workflow is defined using rules. Each rule represents a certain task. Dependencies between rules are defined by input and output files. Each rule can consist out of a shell command, python command or external python scripts (apparently also Rust, R, Julia and JupyterNB are supported).
How to use
In picongpu/share/picongpu/examples/LaserWakefield/lib/python/snakemake/
in the PIConGPU source code you can find:
Snakefile
config.yaml
requirements.txt
params.csv
With these files a parameter scan with the LaserWakefield example of PIConGPU on hemera can be performed. To do so:
Copy the
Snakefile
andconfig.yaml
.Set up an envoirement using the
requirements.txt
. Make sure you havesnakemake
and thesnakemake-executor-plugin-slurm
installed and activated.Adjust the profile
config.yaml
:Define your input parameters in a csv file. This can look like this, for the LaserWakefield example
LASERA0,PULSEDURATION 4.0,1.5e-14 3.0,2.5e-14
Warning
Snakemake will automatically perform a parameter dependend compile using CMAKE flags if and only if the parameter names in the header of the csv file match those in the
.param
file of the PIConGPU project.Specify the path of your PIConGPU project, i.e. the directory where
pic-create
will be executed.Specify the path to your PIConGPU profile and the name of your cfg file.
Optional: Adjust resources and other workflow parameters (see Fine-Tuning).
Start the workflow in the directory where the
Snakefile
andconfig.yaml
are located via
snakemake --profile .
Note
You may want to start your snakemake workflow in a screen-session.
Fine-Tuning
There are several command line options you can use to customise the behaviour of your workflow. An overview can be found in the documentation or by using snakemake --help
. Here are some recomendations:
--jobs N
,-j N
Use a maximum of N jobs in parallel. Set to
unlimited
to allow any number of jobs.
--groups
:By default, each rule/task is run in a single (cluster) job. To run multiple tasks in one job, define groups in the
Snakefile
orconfig.yaml
, which only works if the grouped tasks are connected in the DAG.In this example, the
compile
rule is placed in the “compile” group, so it is possible to run multiple compile processes in a single Slurm job.
--group-components
Indicates how many tasks in a group will be executed in a cluster job.
In this example, by
group-components: "compile=2"
defines that 2 compile processes will be run in one slurm job.This is particularly useful for smaller rules such as python post-processing, where it would be easy to have hundreds of small fast cluster jobs if no grouping took place.
--dry-run
,-n
Does not execute anything.
Useful for checking that the workflow is set up correctly and that only the desired rules are executed.
This is important to ensure that data that has already been written is not erased, because snakemake will re-run jobs if code or input has changed, and will erase the output of the rule before doing so. (In short, if you decide to change a path or some code in the Snakefile, you might re-run expensive simulations).
To prevent simulations from being repeated for the wrong reasons, use:
--rerun-triggers {code,input,mtime,params,software-env}
Define what triggers the rerunning of a job. By default, all triggers are used, which guarantees that results are consistent with the workflow code and configuration.
--retries N
Retries a failed rule N times.
Can be defined for each rule individually.
Also useful if a cluster has a limited walltime and the picongpu flag
--try.restart
is to be used. Since snakemake resubmits the “submit.start”, the simulation will start from the last available checkpoint, when this flag is used.
--latency-wait SECONDS`
Wait given SECONDS if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 5).
Resulting file structure
The output produced by the workflow is stored in three directories next to the Snakefile
.
- “simulations”
Contains simulation directories.
The name of the simulation directory is
sim_{paramspace.wildcard_pattern}
, whereparamspace.wildcard_pattern
becomes, for example,LASERA0-4.0_PULSEDURATION-1.5e-14
.
- “simulated”
Contains txt files indicating whether a simulation has already run and the job id of the simulations on the cluster.
- “projects”
Contains the input directories of the simulations.
If you want to change the file structure, you need to change that in the Snakefile
.
Be aware that paths defined in your Snakefile
are always relative to the location of the Snakefile
.
What it does
The workflow takes input parameters, performs a parameter dependent compile and submits the simulation to the cluster. These steps are defined as so called rules in the Snakefile
. The order in which the rules are executed is defined by the input and the output of the rules. This means that a rule is only executed if it’s output is needed as input by another rule.
Details of the individual rules:
- rule all:
Is the so-called target rule. By default, Snakemake will only execute the very first rule specified in the
Snakefile
. Therefore this pseudo-rule should contain all the anticipated output as its input. Snakemake will then try to generate this input.
- rule build_command:
Is a helper ruler that generates a string that is later used by the
pic-build
command and contains the information about the CMAKE flags.
- rule compile:
Clones the (in the
config.yaml
defined) PIConGPU project usingpic-create
.Since Snakemake relies on files to check dependencies between tasks, and a simulation has no predefined unique output file, the tpl file is modified such that it creates a unique output file, called
finished_{params.name}.txt
when the simulation is finished.Compiles for each parameter set and then creates a simulation directory.
- rule simulate:
To use the
tbg
interface the rule simulate is a local rule.The output file (“simulated/finished_{paramspace.wildcard_pattern}.txt”) is created after the simulation but the shell script would be immediately done after submitting the simulation. If the task is done and the output file is not created an error occurs and the workflow fails. In order to make Snakemake wait till the simulation is finished, the status of the slurm job is checked every two minutes.
This control loop is set up in such a way that even if the snakemake session is aborted or fails, it will catch up with simulations already running when snakemake is restarted.
Warning
The simulate rule looks for
100 % =
instdout
. If the number of time steps and the percentage of output do not match, such an output will never be created (e.g. 1024 time steps and output every 5% will not generate a100 % =
output).
Using the example Snakefile
and params.csv
, the resulting DAG looks like this.
Python post-processing
The script directive
You can automatically post-process your results by adding new rules to the Snakefile
.
Here is an example of what this might look like for a Python script called post_processing.py
:
rule post_processing:
input:
rules.simulate.output
output:
f"results/post_processing_{paramspace.wildcard_pattern}.png"
params:
sim_dir=f"simulations/sim_{paramspace.wildcard_pattern}/simOutput/openPMD", # simulation directory
sim_params=paramspace.instance, # dictionary of parameters to generate this simulation
generic_parameter = 1000
script:
"post_processing.py"
The given script will be run by Snakemake in a special way that puts a snakemake object into the global namespace (of this script).
This object contains useful context information for the running script.
For example, the parameter set of the rule in the Snakefile is stored in the params
member
and can be accessed in your python script via a list- or dictionary-like interface.
So, accessing the sim_dir parameter could be done via
snakemake.params[0], snakemake.params[‘sim_dir’] or snakemake.params.sim_dir.
One can use snakemake.input
or snakemake.output
accordingly.
More details can be found in Snakemake’s documentation.
To run your new rule you can either specify the desired output explicitly via commandline, e.g., snakemake “results/post_processing_<…>.png” … or alter the default rule:
rule all:
input: expand("results/post_processing_{params}.png", params=paramspace.instance_patterns)
Of course you can have as many rules as you want after the simulation, just make sure that Snakemake can build a rule graph by going from the output of one rule to the input of the next rule, ending at the input of the target rule all
.
Note
Note the expand()
function in the all
rule. This can be used to declare that all instances of the parameter space are meant. Further information can be found here.
Recommendations on how to structure scripts for Snakemake
For effective use with Snakemake, your scripts should parametrise aspects of the execution that Snakemake is supposed to organise. Most importantly, these are the input and output filenames but could also be other parameters as seen above. This can be facilitated by putting all your functional code into a def main(input_filename, output_filename, **further_parameters) function. The only “free” code in your script should handle the parameter extraction from the snakemake object and call main(…) with the pertinent values.
import sys
def main(input_filename, output_filename, **further_parameters):
# Put your post-processing here.
# Take the data from the input_filename(s).
# Save the results to output_filename(s).
# Free free to define further functions and use them in here.
pass
if __name__ == "__main__":
# Handle parameter extraction.
try:
# If we're running from within Snakemake,
# there is a `snakemake` object in the global namespace
# that we can get our parameters from.
input_filename = snakemake.input[0]
# ...
except NameError:
# If we got this error,
# likely there was no `snakemake` object in the namespace.
# We need to do something else to get our parameters:
input_filename = sys.argv[1] # use commandline arguments
# ...
# or something more elaborate like argparse, etc.
# Start the post-processing independent of how we extracted the parameters.
main(input_filename, output_filename, **further_parameters)
The above code snippet defines a main() function where you can put your post-processing code. The free code of the script is guarded by an if __name__ == “__main__” clause (see here for an explanation). It consists of two parts extracting the parameters and calling the main(…) function.
The snippet uses a try: … except: … clause to guard against the case where we are not actually running from within Snakemake. The suggested alternative takes arguments from the commandline but other things like raising an Exception or using defaults would work. Having this fallback mechanism comes in handy for debugging and manual testing because we don’t need to fire up Snakemake whenever we want to test something.
Cluster execution
To perform this evaluation on the cluster, add the required resource to the “config.yaml”. For example, like this:
set-resources:
post_processing: # resources for post processing
slurm_partition: "defq"
runtime: 20
nodes: 1
ntasks: 1
mem_mb: 5000
Running on a generic cluster
If you want to run on a cluster other than hemera that doesn’t use the slurm scheduler, check the snakemake plugin catalog if there is an executor plugin for your batch system. If there is no executor plugin for your batch system, you can use the generic cluster execution.
Warning
In any case, the Snakefile
must be adapted to the specific cluster.
The “Snakefile_LSF” is an example for running on a LSF cluster (e.g. Summit) using the generic cluster executer.
- To use it:
Install
snakemake-executor-plugin-cluster-generic
plugin.Adapt the executor and add submit command in the
config.yaml
:
executor: cluster-generic
cluster-generic-submit-cmd: "'bsub -P {resources.proj} -nnodes {resources.nodes} -W {resources.walltime}'"
set-resources:
compile: # define resources for picongpu compile
proj: "csc999" # change to your project!
walltime: 120
nodes: 1
Start workflow with
snakemake --profile .
Note
Recently an LSF executor plugin has been developed which has not been tested with the PIConGPU workflow. If you have access to a LSF cluster, give it a try.