A profile
is a folder that contains all the configuration parameters to successfully run your pipeline. Of note, if you have used a cluster.json
file before, be aware that it has been deprecated.
Run the following script to create the folder structure:
#!/usr/bin/bash
# Create the folder containing the files needed for this tutorial
mkdir snakemake-profile-demo
# Enter the created folder
cd snakemake-profile-demo
# Create an empty file containing the snakemake code
touch snakeFile
# Create toy input files
mkdir inputs
echo "toto" > inputs/hello.txt
echo "totoBis" > inputs/helloBis.txt
# Create an empty folder to create a conda environment
# This is done to make sure that you use the same snakemake version as I do
mkdir envs
touch envs/environment.yaml
Copy the following content to snakeFile
:
rule all:
input:
expand("results/{sampleName}.txt", sampleName=["hello", "helloBis"])
rule printContent:
input:
"inputs/{sampleName}.txt"
output:
"results/{sampleName}.txt"
shell:
"""
cat {input} > {output}
"""
Copy the following content to environment.yaml
:
channels:
- bioconda
dependencies:
- snakemake-minimal=6.15.1
Create and activate the conda environment:
#!/usr/bin/bash
conda env create -p envs/smake --file envs/environment.yaml
conda activate envs/smake
Test the pipeline:
#!/usr/bin/bash
snakemake --snakefile snakeFile --cores=1
In this section, I am going to detail the process of profile creation. This will increase progressively in complexity and we will need to add rules to the snakeFile
. First create a config.yaml
in a profile
folder:
#!/usr/bin/bash
# Create the folder containing the configuration file, it can be named differently
mkdir profile
# Create a config.yaml that will contain all the configuration parameters
touch profile/config.yaml
The first thing we are going to do is to define some general snakemake parameters. To get a complete list of them try snakemake --help
. The choice of parameters is subjective and depends on what you want to achieve. However, I found the one below pretty useful on a daily basis. Let’s start with the parameters that we already used. Add the following content to profile/config.yaml
:
---
snakefile: snakeFile
cores: 1
The ---
at the beginning of the file indicates the start of the document. This is not mandatory to use in our case, this is just a convention. Now run snakemake after deleting the results
folder:
#!/usr/bin/bash
# Delete the results/ folder if present
rm -r results/
# Run snakemake with a dry run mode (option -n)
snakemake --profile profile/ -n
A dry run means that the snakemake pipeline will be evaluated but that no files will be produced. You should obtain:
Building DAG of jobs...
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
all 1 1 1
printContent 2 1 1
total 3 1 1
[Fri Mar 4 08:44:12 2022]
rule printContent:
input: inputs/helloBis.txt
output: results/helloBis.txt
jobid: 2
wildcards: sampleName=helloBis
resources: tmpdir=/tmp
[Fri Mar 4 08:44:12 2022]
rule printContent:
input: inputs/hello.txt
output: results/hello.txt
jobid: 1
wildcards: sampleName=hello
resources: tmpdir=/tmp
[Fri Mar 4 08:44:12 2022]
localrule all:
input: results/hello.txt, results/helloBis.txt
jobid: 0
resources: tmpdir=/tmp
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
all 1 1 1
printContent 2 1 1
total 3 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
As you observe, we were able to reduce the snakemake call from snakemake --snakefile snakeFile --cores=1
to snakemake --profile profile/
. Therefore, the profile enables the definition of all the snakemake options (and more).
Let’s now add more options to profile/config.yaml
:
---
snakefile: snakeFile
cores: 1
latency-wait: 60
reason: True
show-failed-logs: True
keep-going: True
printshellcmds: True
rerun-incomplete: True
restart-times: 3
latency-wait
is useful as your system can sometimes be “slower” than snakemake. This means that even if an output file is created, snakemake might not see it. The default value is of 5 seconds, I usually set it to 60.
If a job fails, for whatever reason, it is possible to ask snakemake to try it again by setting re-run-incomplete
to True. If a job is run, it can be because the file it produces does not exist yet or because a file on which the job depends (i.e. the input file) was created yet neither. Indeed, the point of using snakemake is to write pipelines. Therefore, you will design a series of jobs that depend on one another.
You can see the reason why a job is triggered by setting reason
to True. show-failed-logs
will display logs of failed jobs. keep-going
tells snakemake to continue with independent jobs if one fails. In other words, snakemake will run as many rules as it can before terminating the pipeline. printshellcmds
will print the code that you introduced in the shell
section of your rules. Finally, with experience, you will notice that even if you define well the resources needed for each job (covered in the next post), the process can be prone to hiccups. By setting re-run-incomplete
and restart-times
, you minimize the chance of your pipeline failing even if well coded.
Replace now the content of profile/config.yaml
with the above code and perform a dry run:
#!/usr/bin/bash
# Run snakemake with a dry run mode (option -n)
snakemake --profile profile/ -n
You can see below that the cat
instruction now appears in your terminal with the sampleName wildcards replaced by the actual values:
Building DAG of jobs...
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
all 1 1 1
printContent 2 1 1
total 3 1 1
[Fri Mar 4 09:34:13 2022]
rule printContent:
input: inputs/helloBis.txt
output: results/helloBis.txt
jobid: 2
reason: Missing output files: results/helloBis.txt
wildcards: sampleName=helloBis
resources: tmpdir=/tmp
cat inputs/helloBis.txt > results/helloBis.txt
[Fri Mar 4 09:34:13 2022]
rule printContent:
input: inputs/hello.txt
output: results/hello.txt
jobid: 1
reason: Missing output files: results/hello.txt
wildcards: sampleName=hello
resources: tmpdir=/tmp
cat inputs/hello.txt > results/hello.txt
[Fri Mar 4 09:34:13 2022]
localrule all:
input: results/hello.txt, results/helloBis.txt
jobid: 0
reason: Input files updated by another job: results/helloBis.txt, results/hello.txt
resources: tmpdir=/tmp
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
all 1 1 1
printContent 2 1 1
total 3 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
Overall, we reduced the snakemake command from snakemake --snakefile snakeFile --cores 1 --latency-wait 60 --restart-times 3 --rerun-incomplete --reason --show-failed-logs --keep-going --printshellcmds
to a shorter call snakemake --profile profile/
.
Next week, we will see how to submit your jobs to a cluster. Stay tuned! (Next post)