.. _snakemake: Snakemake ========= A workflow manager like Snakemake is the next step after simple bash scripts towards fully scalable and reproducible analyses. Another popular workflow manager for genomics is `Nextflow `_ and `Wratten et al 2021 `_ is a review of the current state. You can see how many workflow managers in general are out there with `this list `_ (over 300 at the time of this writing). In BSPC, we prefer Snakemake over other workflow managers because: - Its simple combination of bash and Python make it straightforward to convert individual tools' command-line calls into working Snakemake code - Any valid Python is valid Snakemake, and we write a lot of Python so we can leverage that knowledge in our Snakemake files - It affords the most flexibility and customization out of all the workflows we've tried. This is important to us as a bioinformatics core working on a wide variety of projects that often need customization - Snakemake integrates very well with conda and high-performance compute clusters, which we also use a lot Snakemake is a superset of Python that allows for workflow management using "rules" that dictate input/output files for each rule, the commands to convert inputs to outputs, and the dependencies between rules. It can be run locally or on a cluster with no code changes. In the context of genomics workflows, if a new sample is added Snakemake will figure out just the rules that need to be run on that single sample, leaving everything else untouched. It also trivially parallelizes tasks that are run, dramatically speeding up workflow runs. Learning Snakemake ------------------ The following resources are ordered from most basic to more advanced. - Introductory `snakemake slides `_ give a very brief overview of what Snakemake is about. - Live `demo video `_ given by the author. - The `official Snakemake tutorial `_ will give you a more in-depth understanding. - Additional `tutorial from Titus Brown `_ on Snakemake - `Snakemake tutorial specific to NIH's Biowulf `_ Supporting information, tips and tricks --------------------------------------- The rest of this page offers some suggestions, clarifications, and tips. The goal is to add some additional information on topics that we typically see new users struggling with. Thinking like Snakemake ~~~~~~~~~~~~~~~~~~~~~~~ When teaching Snakemake, it's often useful to "think like Snakemake" to understand how all the moving parts fit together. For this exercise, let's create an example Snakefile. .. code-block:: python samples = ["A", "B", "C"] rule all: input: "report.txt" rule sort: input: "{sample}.txt" output: "{sample}_sorted.txt" shell: "sort {input} > {output}" rule report: input: expand("{sample}_sorted.txt", sample=samples), output: "report.txt" shell: "cat /dev/null > {output}; " "for i in {input}; do head -n1 {input} >> {output}" This file sorts each text file (the ``sort`` rule, 1 input --> 1 output) and then the first line of each of those sorted files are added to the report (the ``report`` rule, a many input --> 1 output rule or "aggregation" rule). - Run the first rule encountered. This is the ``all`` rule - Check the input. Here, it's ``report.txt``. Does that file exist? If yes, then we're done. If no, then we need to figure out how to make it. For the purposes of this exercise, assume it doesn't exist yet. - Check the rest of the rules. Are there any rules whose output files match the pattern "report.txt"? - Yes, the ``report`` rule's output matches. It's actually an exact match and there are no wildcards to worry about. - OK, so we at least have to run the ``report`` rule. Let's check its input. - The ``expand()`` function just returns a list. So the input for the ``report`` rule is ``["A_sorted.txt", "B_sorted.txt", "C_sorted.txt"]``. - Let's take those one at a time. The first input is ``A_sorted.txt``. Does the file already exist? Let's assume no. - Check the rest of the rules. Are there any rules whose output files match the pattern "A_sorted.txt"? - Yes, the ``sort`` rule's output matches if we set the ``{sample}`` wildcard to be ``A`` (Snakemake does this wildcard matching internally; sometimes it gets things wrong, see the "wildcard constraints" section below for some hints if that happens) - OK, so we have to run the ``sort`` rule with the ``{sample}`` wildcard set to ``A``. So let's check it's input. What's the input of the ``sort`` rule, if we fill in ``A`` as the value of the ``{{sample}}`` wildcard? - Assuming ``{sample}`` is ``A``, we need ``A.txt``. Does the file already exist? Here we'll assume that yes it does exist because there are no other rules. Otherwise, the Snakefile could not run; it would complain about a missing input file ``A.txt`` because it can't find any rules that could make it, and we'd either have to make a rule to create it, or make sure our starting input data exists. - That took care of the first input for the ``report`` rule. On to the next one! - Does ``B_sorted.txt`` exist? Let's assume that it doesn't exist, just like ``A_sorted.txt``. We'll go through the same process we did for ``A_sorted.txt``. That is, check if any rules' output matches the pattern of this file; if so then create value(s) for the wildcard(s) (here, ``sample="B"``); fill in the input with the wildcard value; check to see if that input exists; if yes, nothing else to do with this "chain" of rule to run; if no then we need to figure out what rule will make the output. Again, here we're assuming that the starting file ``B.txt`` does exist. - OK, back to the ``report`` rule, and that last output file, ``C_sorted.txt``. Let's now imagine it *does* exist (unlike ``A_sorted.txt`` and ``B_sorted.txt``). - If the file exists, then check to see if it is older than the input. If it's older, then we'll need to remake it. If it's newer, it's up-to-date and there's nothing more we need to do with ``C_sorted.txt``. This process of checking input to see if it exists, finding rules whose output patterns match, filling in wildcards into input and seeing if *that* exists, is the general process. Keeping this in mind can help when debugging. Be careful using for-loops in rules ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When initially learning Snakemake, you may be inclined to use ``expand()`` to get a list of files, and then iterate over them in a for-loop within the rule. For example in a ``shell:`` block it might look something like: .. code-block:: python samples = ["A", "B", "C"] # Don't do this. rule example1: input: expand("{sample}.txt", sample=samples) output: expand("{sample}.txt.sorted", sample=samples) shell: "for i in {input}; do " "sort $i > $i.sorted; " "done" or for a ``run:`` block it might look something like: .. code-block:: python samples = ["A", "B", "C"] # Don't do this either. rule example2: input: expand("{sample}.txt", sample=samples) output: expand("{sample}.txt.sorted", sample=samples) run: for fin, fout in zip(input, output): shell("sort {fin} > {fout}") The reasoning for writing a rule like this is typically something like, "I want to do this thing many times, and in R/Python/Bash I use a for-loop to do something many times, so let's figure out how to make a for-loop work in Snakemake". **However, in general, you don't want to use a for-loop in a rule if the number of inputs equals the number of outputs**. Let's explain why. Each of these rules, ``example1`` and ``example2``, will only run once. Because of the for-loop, the sorting of sample ``B`` won't be run until sample ``A`` completes sorting. That is, rather than running in parallel, this runs in serial. So even if we had 3 CPUs to run the sorting of A, B, and C on each one, the way it is written they will all run on a single CPU, so it will take 3x the time it would take if running in parallel. Furthermore, if we add a new sample ``D``, it will force ``A``, ``B``, and ``C`` to run again, even if they were up to date -- wasting time and resources. Here's what it *should* look like: .. code-block:: python # Do it this way instead. samples = ["A", "B", "C"] rule all: input: expand("{sample}.txt.sorted", sample=samples) rule example3: input: "{sample}.txt" output: "{sample}.txt.sorted" shell: "sort {input} > {output}" This last rule will **run once for each sample, in parallel**. Snakemake will kick of one job for sample ``A``, another for sample ``B``, and a third for sample ``C``. If multiple cores have been provided on the command line with ``--cores/-j``, then these three jobs will run in parallel. **In general, a rule will run in parallel as many times as there are unique combinations of output files.** If we look at rules ``example1`` and ``example2`` above, how many unique combinations of output are there? The answer is *one*, because the ``expand()`` fills in the ``{sample}`` wildcard, giving a single list of the three filenames. In contrast, for the ``example3`` rule there are 3 unique sets of output. The rule will run one time where the single output is ``A.txt.sorted`` (where sample=A), another time where the single output file is ``B.txt.sorted`` (sample=B) and a third time for ``C.txt.sorted`` (sample=C). Snakemake can run each of these three rules in parallel, one per CPU as long as there are enough CPUs to do so (most modern machines have multiple cores). How about this example -- how many times will the ``fastqc`` rule run? .. code-block:: python samples = ["A", "B", 'C"] read_numbers = [1, 2] rule all: input: expand("{sample}_R{N}_fastqc.html", sample=samples, N=read_numbers) rule fastqc: input: "{sample}_R{n}.fastqc.gz" output: "{sample}_R{n}_fastqc.html" wrapper: "v1.25.0/bio/fastqc" There are two different wildcards in the output, ``{sample}`` and ``{n}``. There are 3 values for ``sample`` (A, B, C), and 2 values for ``n`` (1, 2) so it will run 6 times in parallel. This is much more subtle than it looks though -- we have ``read_numbers``, ``N``, and ``n``. What does Snakemake actually use? The next section goes into more detail on this. Wildcards are NOT global variables ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Consider the following example: .. code-block:: python samples = ["A", "B", 'C"] read_numbers = [1, 2] rule all: input: expand("{sample}_R{N}_fastqc.html", sample=samples, N=read_numbers), expand("{sample}_R{N}_fastqc.count", sample=samples, N=read_numbers), rule fastqc: input: "{sample}_R{n}.fastqc.gz" output: "{sample}_R{n}_fastqc.html" wrapper: "v1.25.0/bio/fastqc" rule count: input: "{X}_R{Y}.fastq.gz" output: "{X}_R{Y}.fastq.count" shell: 'LINES=$(wc -l {input} | cut -f1 -d " "); ' 'echo $(($LINES / 4)) > {output}' At first, this looks pretty confusing: - We have a list, ``read_numbers``. - We have a wildcard placeholder ``N`` in the ``expand`` call, and we provide the argument ``N=read_numbers``. - In the ``fastqc`` rule, we have an ``{n}`` wildcard. - In the ``count`` rule, we have completely different wildcards altogether (``X`` and ``Y``) **There is no variable** ``n`` or ``X`` or ``Y`` or ``sample``. So how does Snakemake know what to use for ``{n}`` in the output pattern of the ``fastqc`` rule, or ``X`` and ``Y`` in the ``count`` rule? .. note:: This section assumes you understand the section above on "Thinking like Snakemake", so make sure to go back and read that if you need to. Let's think like snakemake. Rule ``all`` has an ``expand()``, which results in a list of filenames. So when Snakemake starts looking at the rules to figure out what rules it needs to run, *the input of the "all" rule is just a list of filenames*. That is, there is no ``{N}`` wildcard anywhere. Let's now take the first filename resulting from that ``expand()`` call, ``A_R1_fastqc.html``. Snakemake looks for a rule whose output matches that pattern, and identifies the ``fastqc`` rule. Snakemake internally figures out that the output pattern of the ``fastqc`` rule will match if ``sample="A"`` and ``n=1``. So it sets the values of those wildcards for the input. **The input wildcards must be found in the output wildcards of the same rule.** So the fact that we have ``{n}`` in the output means that we must use ``{n}`` in the input. And we can only use ``{n}`` in the input because it also exists in the output. But *other* rules' wildcards are *independent*. For example, when Snakemake looks at the input to the ``all`` rule and finds ``A_R1_fastqc.count``, it finds that ``count`` rule's output pattern will match. In that case, Snakemake figures out that it needs to make ``X="A"`` and ``Y=1`` for the duration of this rule, and fills in the ``X`` and ``Y`` accordingly for the input. This was a contrived example; in practice, we generally keep the wildcards consistent across rules for clarity. Directed Acyclic Graph (DAG) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is a quick way to visualize the workflow and associated wildcards. Using this tool at various parts of your snakefile development process can help you understand where splitting/aggregation steps have occurred and the rule dependencies of your data. .. note:: If you don't already have it in your conda environment, on NIH's Biowulf you can get graphviz with ``module load graphviz`` on an interactive node. - a DAG will include the wildcards - a rulegraph will **not** include the wildcards .. code-block:: bash snakemake -dag | dot --Tpng > dag.png snakemake --rulegraph | dot --Tpng > rulegraph.png Expand ~~~~~~ The ``expand()`` function is not anything magical or special -- it is simply a convenient way to generate a regular Python list. .. code-block:: bash states=["Colorado", "Texas", Maryland"] rule all: input: expand("{state}.txt", states_list=state) is exactly equivalent to: .. code-block:: bash rule all: input: ["Colorado.txt", "Texas.txt", Maryland.txt"] Sometimes, you want to "protect" wildcards for feeding them into the next rule. In earlier versions of Snakemake, you can "escape" the ``{`` by using ``{{``. The ``expand()`` function will fill in other wildcards, and the ``{{ }}`` will become ``{ }`` in the resulting list. For example: .. code-block:: bash states=["Colorado", "Texas", Maryland"] expand("{state}_{{n}}.txt", state=states) # returns: ["Colorado_{n}.txt", "Texas_{n}.txt", Maryland_{n}.txt"] In later versions of Snakemake, you can use the ``allow_missing=True`` argument when calling ``expand()`` .. seealso:: - Snakemake docs `FAQ on not expanding every wildcard `_ - Snakemake docs `section on expand() `_ - `Source code for expand() `_ in a recent version (look in the `snakemake/io.py` file in any version) Wildcard constraints ~~~~~~~~~~~~~~~~~~~~ Another way to help control the wildcard is to use a ``wildcard_constraint``. This can be added at the top of your Snakefile (acting as a global variable) or within your individual rules (local variable). A wildcard constraint can help minimize all the possible combinations that are filled in. For example in ``{state}_{number}.txt`` you may want to specify that ``{state}`` may only be filled in with the following names: "Colorado", "Texas", or "Maryland" and ``{number}`` with "1", "2", "3". Why? Because maybe you have a list that pulls all the possible combinations so you don't want something called "1_Colorado.txt" or "2_3.txt" (total of 18 possible unique filename combinations), you want something like "Colorado_1.txt", "Colorado_2.txt", "Colorado_3.txt", "Texas_1.txt", "Texas_2.txt", etc (total of 9 possible unique filename combinations). Constraints are specified using regular expressions, so it may look like this:: wildcard_constraints: state="Colorado|Texas|Maryland", number="1|2|3" .. seealso:: - `wildcard constraint section of tutorial `_ - `wildcards section `_ of the Snakemake docs has some more info - `Python regular expression docs `_ - `Pythex `_, a great website for interactively building/debugging regular expressions Conda envs and ``run:`` blocks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You may find it useful to use ``conda:`` directives in rules so that the rule runs in its own environment. Keep in mind though that you can't use this mechanism if you have a ``run:`` block. A ``run`` block doesn't work with a separate conda environment because the Python code is sort of interleaved within the Snakefile, and there's no way to activate a conda environment from within a running Python instance and somehow swap over all running code from memory into the new environment. ``shell``, ``wrapper``, and ``script`` are all fine to use with the ``conda:`` directive, so the solution is typically to refactor the code into a script or wrapper. .. seealso:: This is also discussed in `this Snakemake FAQ `_.