Automated Parameter Scans using Snakemake

Snakemake is a python based workflow engine that can be used to automate compiling, running and post-processing of PIConGPU simulations, or any other workflow that can be represented as a directed acyclic graph (DAG).

Each workflow consists of a Snakefile in which the workflow is defined using rules. Each rule represents a certain task. Dependencies between rules are defined by input and output files. Each rule can consist out of a shell command, python command or external python scripts (apparently also Rust, R, Julia and JupyterNB are supported).

How to use

In picongpu/share/picongpu/examples/LaserWakefield/lib/python/snakemake/ in the PIConGPU source code you can find:

  • Snakefile

  • config.yaml

  • requirements.txt

  • params.csv

With these files a parameter scan with the LaserWakefield example of PIConGPU on hemera can be performed. To do so:

  1. Copy the Snakefile and config.yaml.

  2. Set up an envoirement using the requirements.txt. Make sure you have snakemake and the snakemake-executor-plugin-slurm installed and activated.

  3. Adjust the profile config.yaml:

    • Define your input parameters in a csv file. This can look like this, for the LaserWakefield example

    LASERA0,PULSEDURATION
    4.0,1.5e-14
    3.0,2.5e-14
    

    Warning

    Snakemake will automatically perform a parameter dependend compile using CMAKE flags if and only if the parameter names in the header of the csv file match those in the .param file of the PIConGPU project.

    • Specify the path of your PIConGPU project, i.e. the directory where pic-create will be executed.

    • Specify the path to your PIConGPU profile and the name of your cfg file.

    • Optional: Adjust resources and other workflow parameters (see Fine-Tuning).

  4. Start the workflow in the directory where the Snakefile and config.yaml are located via

snakemake --profile .

Note

You may want to start your snakemake workflow in a screen-session.

Fine-Tuning

There are several command line options you can use to customise the behaviour of your workflow. An overview can be found in the documentation or by using snakemake --help. Here are some recomendations:

  • --jobs N, -j N
    • Use a maximum of N jobs in parallel. Set to unlimited to allow any number of jobs.

  • --groups:
    • By default, each rule/task is run in a single (cluster) job. To run multiple tasks in one job, define groups in the Snakefile or config.yaml, which only works if the grouped tasks are connected in the DAG.

    • In this example, the compile rule is placed in the “compile” group, so it is possible to run multiple compile processes in a single Slurm job.

  • --group-components
    • Indicates how many tasks in a group will be executed in a cluster job.

    • In this example, by group-components: "compile=2" defines that 2 compile processes will be run in one slurm job.

    • This is particularly useful for smaller rules such as python post-processing, where it would be easy to have hundreds of small fast cluster jobs if no grouping took place.

  • --dry-run, -n
    • Does not execute anything.

    • Useful for checking that the workflow is set up correctly and that only the desired rules are executed.

    • This is important to ensure that data that has already been written is not erased, because snakemake will re-run jobs if code or input has changed, and will erase the output of the rule before doing so. (In short, if you decide to change a path or some code in the Snakefile, you might re-run expensive simulations).

    • To prevent simulations from being repeated for the wrong reasons, use:

  • --rerun-triggers {code,input,mtime,params,software-env}
    • Define what triggers the rerunning of a job. By default, all triggers are used, which guarantees that results are consistent with the workflow code and configuration.

  • --retries N
    • Retries a failed rule N times.

    • Can be defined for each rule individually.

    • Also useful if a cluster has a limited walltime and the picongpu flag --try.restart is to be used. Since snakemake resubmits the “submit.start”, the simulation will start from the last available checkpoint, when this flag is used.

  • --latency-wait SECONDS`
    • Wait given SECONDS if an output file of a job is not present after the job finished. This helps if your filesystem suffers from latency (default 5).

Resulting file structure

The output produced by the workflow is stored in three directories next to the Snakefile.

  • “simulations”
    • Contains simulation directories.

    • The name of the simulation directory is sim_{paramspace.wildcard_pattern}, where paramspace.wildcard_pattern becomes, for example, LASERA0-4.0_PULSEDURATION-1.5e-14.

  • “simulated”
    • Contains txt files indicating whether a simulation has already run and the job id of the simulations on the cluster.

  • “projects”
    • Contains the input directories of the simulations.

If you want to change the file structure, you need to change that in the Snakefile. Be aware that paths defined in your Snakefile are always relative to the location of the Snakefile.

What it does

The workflow takes input parameters, performs a parameter dependent compile and submits the simulation to the cluster. These steps are defined as so called rules in the Snakefile. The order in which the rules are executed is defined by the input and the output of the rules. This means that a rule is only executed if it’s output is needed as input by another rule.

Details of the individual rules:

  • rule all:
    • Is the so-called target rule. By default, Snakemake will only execute the very first rule specified in the Snakefile. Therefore this pseudo-rule should contain all the anticipated output as its input. Snakemake will then try to generate this input.

  • rule build_command:
    • Is a helper ruler that generates a string that is later used by the pic-build command and contains the information about the CMAKE flags.

  • rule compile:
    • Clones the (in the config.yaml defined) PIConGPU project using pic-create.

    • Since Snakemake relies on files to check dependencies between tasks, and a simulation has no predefined unique output file, the tpl file is modified such that it creates a unique output file, called finished_{params.name}.txt when the simulation is finished.

    • Compiles for each parameter set and then creates a simulation directory.

  • rule simulate:
    • To use the tbg interface the rule simulate is a local rule.

    • The output file (“simulated/finished_{paramspace.wildcard_pattern}.txt”) is created after the simulation but the shell script would be immediately done after submitting the simulation. If the task is done and the output file is not created an error occurs and the workflow fails. In order to make Snakemake wait till the simulation is finished, the status of the slurm job is checked every two minutes.

    • This control loop is set up in such a way that even if the snakemake session is aborted or fails, it will catch up with simulations already running when snakemake is restarted.

    Warning

    The simulate rule looks for 100 % = in stdout. If the number of time steps and the percentage of output do not match, such an output will never be created (e.g. 1024 time steps and output every 5% will not generate a 100 % = output).


Using the example Snakefile and params.csv, the resulting DAG looks like this.

../_images/dag.png

Python post-processing

The script directive

You can automatically post-process your results by adding new rules to the Snakefile. Here is an example of what this might look like for a Python script called post_processing.py:

rule post_processing:
    input:
        rules.simulate.output
    output:
        f"results/post_processing_{paramspace.wildcard_pattern}.png"
    params:
        sim_dir=f"simulations/sim_{paramspace.wildcard_pattern}/simOutput/openPMD", # simulation directory
        sim_params=paramspace.instance, # dictionary of parameters to generate this simulation
        generic_parameter = 1000
    script:
        "post_processing.py"

The given script will be run by Snakemake in a special way that puts a snakemake object into the global namespace (of this script). This object contains useful context information for the running script. For example, the parameter set of the rule in the Snakefile is stored in the params member and can be accessed in your python script via a list- or dictionary-like interface. So, accessing the sim_dir parameter could be done via snakemake.params[0], snakemake.params[‘sim_dir’] or snakemake.params.sim_dir. One can use snakemake.input or snakemake.output accordingly. More details can be found in Snakemake’s documentation.

To run your new rule you can either specify the desired output explicitly via commandline, e.g., snakemake “results/post_processing_<…>.png” … or alter the default rule:

rule all:
    input: expand("results/post_processing_{params}.png", params=paramspace.instance_patterns)

Of course you can have as many rules as you want after the simulation, just make sure that Snakemake can build a rule graph by going from the output of one rule to the input of the next rule, ending at the input of the target rule all.

Note

Note the expand() function in the all rule. This can be used to declare that all instances of the parameter space are meant. Further information can be found here.

Recommendations on how to structure scripts for Snakemake

For effective use with Snakemake, your scripts should parametrise aspects of the execution that Snakemake is supposed to organise. Most importantly, these are the input and output filenames but could also be other parameters as seen above. This can be facilitated by putting all your functional code into a def main(input_filename, output_filename, **further_parameters) function. The only “free” code in your script should handle the parameter extraction from the snakemake object and call main(…) with the pertinent values.

import sys

def main(input_filename, output_filename, **further_parameters):
    # Put your post-processing here.
    # Take the data from the input_filename(s).
    # Save the results to output_filename(s).
    # Free free to define further functions and use them in here.
    pass

if __name__ == "__main__":

    # Handle parameter extraction.
    try:
        # If we're running from within Snakemake,
        # there is a `snakemake` object in the global namespace
        # that we can get our parameters from.

        input_filename = snakemake.input[0]
        # ...

    except NameError:
        # If we got this error,
        # likely there was no `snakemake` object in the namespace.
        # We need to do something else to get our parameters:

        input_filename = sys.argv[1]  # use commandline arguments
        # ...

        # or something more elaborate like argparse, etc.

    # Start the post-processing independent of how we extracted the parameters.
    main(input_filename, output_filename, **further_parameters)

The above code snippet defines a main() function where you can put your post-processing code. The free code of the script is guarded by an if __name__ == “__main__” clause (see here for an explanation). It consists of two parts extracting the parameters and calling the main(…) function.

The snippet uses a try: … except: … clause to guard against the case where we are not actually running from within Snakemake. The suggested alternative takes arguments from the commandline but other things like raising an Exception or using defaults would work. Having this fallback mechanism comes in handy for debugging and manual testing because we don’t need to fire up Snakemake whenever we want to test something.

Cluster execution

To perform this evaluation on the cluster, add the required resource to the “config.yaml”. For example, like this:

set-resources:
  post_processing: # resources for post processing
    slurm_partition: "defq"
    runtime: 20
    nodes: 1
    ntasks: 1
    mem_mb: 5000

Running on a generic cluster

If you want to run on a cluster other than hemera that doesn’t use the slurm scheduler, check the snakemake plugin catalog if there is an executor plugin for your batch system. If there is no executor plugin for your batch system, you can use the generic cluster execution.

Warning

In any case, the Snakefile must be adapted to the specific cluster.

The “Snakefile_LSF” is an example for running on a LSF cluster (e.g. Summit) using the generic cluster executer.

To use it:
  • Install snakemake-executor-plugin-cluster-generic plugin.

  • Adapt the executor and add submit command in the config.yaml:

executor: cluster-generic
cluster-generic-submit-cmd: "'bsub -P {resources.proj} -nnodes {resources.nodes} -W {resources.walltime}'"
set-resources:
  compile: # define resources for picongpu compile
    proj: "csc999" # change to your project!
    walltime: 120
    nodes: 1
  • Start workflow with

snakemake --profile .

Note

Recently an LSF executor plugin has been developed which has not been tested with the PIConGPU workflow. If you have access to a LSF cluster, give it a try.