Edit
Snakemake profile – 3: Cluster submission – Defining parameters – Bioinformatics Services

Bioinformatics Services

Snakemake profile – 3: Cluster submission – Defining parameters

Male profile by Lucian Freud

The power of snakemake is to enable parallelization. Through the use of a cluster, jobs can be processed in parallel automatically. One needs to define how snakemake will handle the job submission process. If you already followed the two first posts (1,2), you can skip the first section.

Preparation of files

For more details about the steps described in this section, see the previous posts. Run the following script to create the folder structure:

#!/usr/bin/bash
# Create the folder containing the files needed for this tutorial
mkdir snakemake-profile-demo
# Enter the created folder
cd snakemake-profile-demo
# Create an empty file containing the snakemake code
touch snakeFile
# Create toy input files
mkdir inputs
echo "toto" > inputs/hello.txt
echo "totoBis" > inputs/helloBis.txt
# Create the folder containing the configuration file, it can be named differently
mkdir profile
# Create a config.yaml that will contain all the configuration parameters
touch profile/config.yaml
# Create an empty folder to create a conda environment
# This is done to make sure that you use the same snakemake version as I do
mkdir envs
touch envs/environment.yaml

Copy the following content to snakeFile:

rule all:
  input:
    expand("results/{sampleName}.txt", sampleName=["hello", "helloBis"])
rule printContent:
  input:
    "inputs/{sampleName}.txt"
  output:
    "results/{sampleName}.txt"
  shell:
    """
    cat {input} > {output}
    """

Copy the following content to environment.yaml:

channels:
  - bioconda
dependencies:
  - snakemake-minimal=6.15.1

Copy the following content to profile/config.yaml:

---
snakefile: snakeFile
cores: 1
latency-wait: 60
reason: True
show-failed-logs: True
keep-going: True
printshellcmds: True
rerun-incomplete: True
restart-times: 3

Create and activate the conda environment:

#!/usr/bin/bash
conda env create -p envs/smake --file envs/environment.yaml
conda activate envs/smake

Defining parameters

Add the cluster submission section at the bottom of profile/config.yaml:

---
snakefile: snakeFile
cores: 1
latency-wait: 60
reason: True
show-failed-logs: True
keep-going: True
printshellcmds: True
rerun-incomplete: True
restart-times: 3
# Cluster submission
jobname: "{rule}.{jobid}"              # Provide a custom name for the jobscript that is submitted to the cluster.
max-jobs-per-second: 1                 #Maximal number of cluster/drmaa jobs per second, default is 10, fractions allowed.
max-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10
jobs: 400                              #Use at most N CPU cluster/cloud jobs in parallel.

jobname has the default value of “snakejob.{name}.{jobid}.sh”. I made it shorter in the code above. One last thing to do is to define how the cluster will handle the jobs. This is system-specific and the choice of options is subjective.


In this section, I will show how to define the options on a slurm system. Please adapt the code to yours. For a complete list of options check sbatch --help. A minimal setup would consist of:

cluster: "sbatch --output=\"jobs/{rule}/slurm_%x_%j.out\" --error=\"jobs/{rule}/slurm_%x_%j.log\""

This instruction tells the cluster to write the console output in the file “jobs/printContent/slurm_printContent.1_355014.out” and the potential errors to “jobs/printContent/slurm_printContent.1_355014.log”. The {rule} wildcards has
been replaced by printContent; %x is a slurm variable corresponding to the job name (that we defined as “{rule}.{jobid}”); and %j is a slurm variable corresponding to the job number attributed by the cluster. Add this line to profile/config.yaml:

---
snakefile: snakeFile
cores: 1
latency-wait: 60
reason: True
show-failed-logs: True
keep-going: True
printshellcmds: True
rerun-incomplete: True
restart-times: 3
# Cluster submission
jobname: "{rule}.{jobid}"              # Provide a custom name for the jobscript that is submitted to the cluster.
max-jobs-per-second: 1                 #Maximal number of cluster/drmaa jobs per second, default is 10, fractions allowed.
max-status-checks-per-second: 10       #Maximal number of job status checks per second, default is 10
jobs: 400                              #Use at most N CPU cluster/cloud jobs in parallel.
cluster: "sbatch --output=\"jobs/{rule}/slurm_%x_%j.out\" --error=\"jobs/{rule}/slurm_%x_%j.log\""

We need to create the jobs/{rule} folders when snakemake runs. We can use an onstart section in snakeFile that will trigger instructions when the pipeline is loaded:

onstart:
    print("##### Creating profile pipeline #####\n") 
    print("\t Creating jobs output subfolders...\n")
    shell("mkdir -p jobs/printContent")
rule all:
  input:
    expand("results/{sampleName}.txt", sampleName=["hello", "helloBis"])
rule printContent:
  input:
    "inputs/{sampleName}.txt"
  output:
    "results/{sampleName}.txt"
  shell:
    """
    cat {input} > {output}
    """

First, perform a dry run to verify that everything works and then run the pipeline per se:

#!/usr/bin/bash
# If you did not already, activate the environment
conda activate envs/smake
# Perform dry run
snakemake --profile profile/ -n
# Run the pipeline
snakemake --profile profile/

Verify that the two jobs printContent are indeed running on your cluster. In slurm try squeue -i10 --user myusername. You will also notice messages in your console such as Submitted job 1 with external jobid 'Submitted batch job 35248057'.
Now verify that the files were created in the jobs folder:

#!/usr/bin/bash
ls jobs/printContent/*
more jobs/printContent/*

As the command of the rule printContent does not fail, you should get empty slurm_printContent.[1-2]_[0-9]+.out files. The .log files should contain what was printed on your console during the run:

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Select jobs to execute...
[Fri Mar  4 14:37:34 2022]
rule printContent:
    input: inputs/hello.txt
    output: results/hello.txt
    jobid: 0
    wildcards: sampleName=hello
    resources: mem_mb=1000, disk_mb=1000, tmpdir=/scratch/jobs/35248057
    cat inputs/hello.txt > results/hello.txt
[Fri Mar  4 14:37:35 2022]
Finished job 0.
1 of 1 steps (100%) done

Above you can see that two new pieces of information were added to resources: mem_mb and disk_mb. These specify the amount of RAM and disk space used by the job. The value of 1000 was given by default.

Next week, we will see how to define these resources in profile/config.yaml. Stay tuned!

Edit