Snakemake¶
A workflow manager like Snakemake is the next step after simple bash scripts towards fully scalable and reproducible analyses. Another popular workflow manager for genomics is Nextflow and Wratten et al 2021 is a review of the current state. You can see how many workflow managers in general are out there with this list (over 300 at the time of this writing).
In BSPC, we prefer Snakemake over other workflow managers because:
Its simple combination of bash and Python make it straightforward to convert individual tools’ command-line calls into working Snakemake code
Any valid Python is valid Snakemake, and we write a lot of Python so we can leverage that knowledge in our Snakemake files
It affords the most flexibility and customization out of all the workflows we’ve tried. This is important to us as a bioinformatics core working on a wide variety of projects that often need customization
Snakemake integrates very well with conda and high-performance compute clusters, which we also use a lot
Snakemake is a superset of Python that allows for workflow management using “rules” that dictate input/output files for each rule, the commands to convert inputs to outputs, and the dependencies between rules. It can be run locally or on a cluster with no code changes. In the context of genomics workflows, if a new sample is added Snakemake will figure out just the rules that need to be run on that single sample, leaving everything else untouched. It also trivially parallelizes tasks that are run, dramatically speeding up workflow runs.
Learning Snakemake¶
The following resources are ordered from most basic to more advanced.
Introductory snakemake slides give a very brief overview of what Snakemake is about.
Live demo video given by the author.
The official short tutorial gives a quick overview.
The official Snakemake tutorial will give you a more in-depth understanding.
Additional tutorial from Titus Brown on Snakemake
Supporting information, tips and tricks¶
The rest of this page offers some suggestions, clarifications, and tips. The goal is to add some additional information on topics that we typically see new users struggling with.
Thinking like Snakemake¶
When teaching Snakemake, it’s often useful to “think like Snakemake” to understand how all the moving parts fit together. For this exercise, let’s create an example Snakefile.
samples = ["A", "B", "C"]
rule all:
input:
"report.txt"
rule sort:
input:
"{sample}.txt"
output:
"{sample}_sorted.txt"
shell:
"sort {input} > {output}"
rule report:
input:
expand("{sample}_sorted.txt", sample=samples),
output:
"report.txt"
shell:
"cat /dev/null > {output}; "
"for i in {input}; do head -n1 {input} >> {output}"
This file sorts each text file (the sort
rule, 1 input –> 1 output) and
then the first line of each of those sorted files are added to the report (the
report
rule, a many input –> 1 output rule or “aggregation” rule).
Run the first rule encountered. This is the
all
ruleCheck the input. Here, it’s
report.txt
. Does that file exist? If yes, then we’re done. If no, then we need to figure out how to make it. For the purposes of this exercise, assume it doesn’t exist yet.Check the rest of the rules. Are there any rules whose output files match the pattern “report.txt”?
Yes, the
report
rule’s output matches. It’s actually an exact match and there are no wildcards to worry about.OK, so we at least have to run the
report
rule. Let’s check its input.The
expand()
function just returns a list. So the input for thereport
rule is["A_sorted.txt", "B_sorted.txt", "C_sorted.txt"]
.Let’s take those one at a time. The first input is
A_sorted.txt
. Does the file already exist? Let’s assume no.Check the rest of the rules. Are there any rules whose output files match the pattern “A_sorted.txt”?
Yes, the
sort
rule’s output matches if we set the{sample}
wildcard to beA
(Snakemake does this wildcard matching internally; sometimes it gets things wrong, see the “wildcard constraints” section below for some hints if that happens)OK, so we have to run the
sort
rule with the{sample}
wildcard set toA
. So let’s check it’s input. What’s the input of thesort
rule, if we fill inA
as the value of the{{sample}}
wildcard?Assuming
{sample}
isA
, we needA.txt
. Does the file already exist? Here we’ll assume that yes it does exist because there are no other rules. Otherwise, the Snakefile could not run; it would complain about a missing input fileA.txt
because it can’t find any rules that could make it, and we’d either have to make a rule to create it, or make sure our starting input data exists.That took care of the first input for the
report
rule. On to the next one!Does
B_sorted.txt
exist? Let’s assume that it doesn’t exist, just likeA_sorted.txt
. We’ll go through the same process we did forA_sorted.txt
. That is, check if any rules’ output matches the pattern of this file; if so then create value(s) for the wildcard(s) (here,sample="B"
); fill in the input with the wildcard value; check to see if that input exists; if yes, nothing else to do with this “chain” of rule to run; if no then we need to figure out what rule will make the output. Again, here we’re assuming that the starting fileB.txt
does exist.OK, back to the
report
rule, and that last output file,C_sorted.txt
. Let’s now imagine it does exist (unlikeA_sorted.txt
andB_sorted.txt
).If the file exists, then check to see if it is older than the input. If it’s older, then we’ll need to remake it. If it’s newer, it’s up-to-date and there’s nothing more we need to do with
C_sorted.txt
.
This process of checking input to see if it exists, finding rules whose output patterns match, filling in wildcards into input and seeing if that exists, is the general process. Keeping this in mind can help when debugging.
Be careful using for-loops in rules¶
When initially learning Snakemake, you may be inclined to use expand()
to
get a list of files, and then iterate over them in a for-loop within the rule.
For example in a shell:
block it might look something like:
samples = ["A", "B", "C"]
# Don't do this.
rule example1:
input:
expand("{sample}.txt", sample=samples)
output:
expand("{sample}.txt.sorted", sample=samples)
shell:
"for i in {input}; do "
"sort $i > $i.sorted; "
"done"
or for a run:
block it might look something like:
samples = ["A", "B", "C"]
# Don't do this either.
rule example2:
input:
expand("{sample}.txt", sample=samples)
output:
expand("{sample}.txt.sorted", sample=samples)
run:
for fin, fout in zip(input, output):
shell("sort {fin} > {fout}")
The reasoning for writing a rule like this is typically something like, “I want to do this thing many times, and in R/Python/Bash I use a for-loop to do something many times, so let’s figure out how to make a for-loop work in Snakemake”.
However, in general, you don’t want to use a for-loop in a rule if the number of inputs equals the number of outputs. Let’s explain why.
Each of these rules, example1
and example2
, will only run once.
Because of the for-loop, the sorting of sample B
won’t be run until sample
A
completes sorting. That is, rather than running in parallel, this runs in
serial. So even if we had 3 CPUs to run the sorting of A, B, and C on each one,
the way it is written they will all run on a single CPU, so it will take 3x the
time it would take if running in parallel.
Furthermore, if we add a new sample D
, it will force A
, B
, and
C
to run again, even if they were up to date – wasting time and resources.
Here’s what it should look like:
# Do it this way instead.
samples = ["A", "B", "C"]
rule all:
input: expand("{sample}.txt.sorted", sample=samples)
rule example3:
input:
"{sample}.txt"
output:
"{sample}.txt.sorted"
shell:
"sort {input} > {output}"
This last rule will run once for each sample, in parallel. Snakemake
will kick of one job for sample A
, another for sample B
, and a third
for sample C
. If multiple cores have been provided on the command line with
--cores/-j
, then these three jobs will run in parallel.
In general, a rule will run in parallel as many times as there are unique combinations of output files.
If we look at rules example1
and example2
above, how many unique
combinations of output are there? The answer is one, because the expand()
fills in the {sample}
wildcard, giving a single list of the three
filenames.
In contrast, for the example3
rule there are 3 unique sets of output. The
rule will run one time where the single output is A.txt.sorted
(where
sample=A), another time where the single output file is B.txt.sorted
(sample=B) and a third time for C.txt.sorted
(sample=C). Snakemake can run
each of these three rules in parallel, one per CPU as long as there are enough
CPUs to do so (most modern machines have multiple cores).
How about this example – how many times will the fastqc
rule run?
samples = ["A", "B", 'C"]
read_numbers = [1, 2]
rule all:
input:
expand("{sample}_R{N}_fastqc.html", sample=samples, N=read_numbers)
rule fastqc:
input:
"{sample}_R{n}.fastqc.gz"
output:
"{sample}_R{n}_fastqc.html"
wrapper:
"v1.25.0/bio/fastqc"
There are two different wildcards in the output, {sample}
and {n}
.
There are 3 values for sample
(A, B, C), and 2 values for n
(1, 2) so
it will run 6 times in parallel.
This is much more subtle than it looks though – we have read_numbers
,
N
, and n
. What does Snakemake actually use? The next section goes into
more detail on this.
Wildcards are NOT global variables¶
Consider the following example:
samples = ["A", "B", 'C"]
read_numbers = [1, 2]
rule all:
input:
expand("{sample}_R{N}_fastqc.html", sample=samples, N=read_numbers),
expand("{sample}_R{N}_fastqc.count", sample=samples, N=read_numbers),
rule fastqc:
input:
"{sample}_R{n}.fastqc.gz"
output:
"{sample}_R{n}_fastqc.html"
wrapper:
"v1.25.0/bio/fastqc"
rule count:
input:
"{X}_R{Y}.fastq.gz"
output:
"{X}_R{Y}.fastq.count"
shell:
'LINES=$(wc -l {input} | cut -f1 -d " "); '
'echo $(($LINES / 4)) > {output}'
At first, this looks pretty confusing:
We have a list,
read_numbers
.We have a wildcard placeholder
N
in theexpand
call, and we provide the argumentN=read_numbers
.In the
fastqc
rule, we have an{n}
wildcard.In the
count
rule, we have completely different wildcards altogether (X
andY
)
There is no variable n
or X
or Y
or sample
. So how does
Snakemake know what to use for {n}
in the output pattern of the fastqc
rule, or X
and Y
in the count
rule?
Note
This section assumes you understand the section above on “Thinking like Snakemake”, so make sure to go back and read that if you need to.
Let’s think like snakemake. Rule all
has an expand()
, which results in
a list of filenames. So when Snakemake starts looking at the rules to figure
out what rules it needs to run, the input of the “all” rule is just a list of
filenames. That is, there is no {N}
wildcard anywhere.
Let’s now take the first filename resulting from that expand()
call,
A_R1_fastqc.html
. Snakemake looks for a rule whose output matches that
pattern, and identifies the fastqc
rule. Snakemake internally figures out
that the output pattern of the fastqc
rule will match if sample="A"
and
n=1
. So it sets the values of those wildcards for the input. The input
wildcards must be found in the output wildcards of the same rule. So the fact
that we have {n}
in the output means that we must use {n}
in the input.
And we can only use {n}
in the input because it also exists in the output.
But other rules’ wildcards are independent.
For example, when Snakemake looks at the input to the all
rule and finds
A_R1_fastqc.count
, it finds that count
rule’s output pattern will
match. In that case, Snakemake figures out that it needs to make X="A"
and
Y=1
for the duration of this rule, and fills in the X
and Y
accordingly for the input.
This was a contrived example; in practice, we generally keep the wildcards consistent across rules for clarity.
Directed Acyclic Graph (DAG)¶
This is a quick way to visualize the workflow and associated wildcards. Using this tool at various parts of your snakefile development process can help you understand where splitting/aggregation steps have occurred and the rule dependencies of your data.
Note
If you don’t already have it in your conda environment, on NIH’s Biowulf
you can get graphviz with module load graphviz
on an interactive node.
a DAG will include the wildcards
a rulegraph will not include the wildcards
snakemake -dag | dot --Tpng > dag.png
snakemake --rulegraph | dot --Tpng > rulegraph.png
Expand¶
The expand()
function is not anything magical or special – it is simply
a convenient way to generate a regular Python list.
states=["Colorado", "Texas", Maryland"]
rule all:
input:
expand("{state}.txt", states_list=state)
is exactly equivalent to:
rule all:
input:
["Colorado.txt", "Texas.txt", Maryland.txt"]
Sometimes, you want to “protect” wildcards for feeding them into the next rule.
In earlier versions of Snakemake, you can “escape” the {
by using {{
.
The expand()
function will fill in other wildcards, and the {{ }}
will
become { }
in the resulting list. For example:
states=["Colorado", "Texas", Maryland"]
expand("{state}_{{n}}.txt", state=states)
# returns:
["Colorado_{n}.txt", "Texas_{n}.txt", Maryland_{n}.txt"]
In later versions of Snakemake, you can use the allow_missing=True
argument when calling expand()
See also
Snakemake docs FAQ on not expanding every wildcard
Snakemake docs section on expand()
Source code for expand() in a recent version (look in the snakemake/io.py file in any version)
Wildcard constraints¶
Another way to help control the wildcard is to use a wildcard_constraint
.
This can be added at the top of your Snakefile (acting as a global variable) or
within your individual rules (local variable). A wildcard constraint can help
minimize all the possible combinations that are filled in. For example in
{state}_{number}.txt
you may want to specify that {state}
may only be
filled in with the following names: “Colorado”, “Texas”, or “Maryland” and
{number}
with “1”, “2”, “3”. Why? Because maybe you have a list that pulls
all the possible combinations so you don’t want something called
“1_Colorado.txt” or “2_3.txt” (total of 18 possible unique filename
combinations), you want something like “Colorado_1.txt”, “Colorado_2.txt”,
“Colorado_3.txt”, “Texas_1.txt”, “Texas_2.txt”, etc (total of 9 possible unique
filename combinations). Constraints are specified using regular expressions, so
it may look like this:
wildcard_constraints:
state="Colorado|Texas|Maryland",
number="1|2|3"
See also
wildcards section of the Snakemake docs has some more info
Pythex, a great website for interactively building/debugging regular expressions
Conda envs and run:
blocks¶
You may find it useful to use conda:
directives in rules so that the rule
runs in its own environment. Keep in mind though that you can’t use this
mechanism if you have a run:
block. A run
block doesn’t work with
a separate conda environment because the Python code is sort of interleaved
within the Snakefile, and there’s no way to activate a conda environment from
within a running Python instance and somehow swap over all running code from
memory into the new environment. shell
, wrapper
, and script
are all
fine to use with the conda:
directive, so the solution is typically to
refactor the code into a script or wrapper.
See also
This is also discussed in this Snakemake FAQ.