snakemake – BIOE591: Genomics for Ecology & Conservation

Bioinformatics Pipelines with `Snakemake`

Over the last four weeks you have quality filtered and trimmed reads, aligned them to a reference sequence, called variants, and filtered those variants for particular analytic goals. To do so, you have created directories, moved files, renamed files, run commands from bioinformatics tools in .sbatch scripts, and manipulated their output. My hope is that you have already learned a lot and built the competence and mental framework for deploying similar programs in the future. But I suspect that with any new-found confidence you may have developed has come a profound sense of the complexity and hazards of this “dry lab” work—particularly its dependency on the correct file paths, arguments, and software versions. Today’s lab, the final in our bioniformatics sequence, attempts to address some these anxieties and tie together the skills have you built with a workflow manager called Snakemake.

Tuesday’s reading should have helped solidify your understanding of what workflow managers are and what they do. It should also have provided some specific background on Snakemake, tool we will learn to deploy below. Note that this is only one of the several workflow managers you might choose for your own research, with Nextflow in particular bragging similarly wide use. But because Snakemake is based on Python, a program language likely to be more famililar to you (or at least readable) than “Groovy” (seriously), it will be our choice.

Snakemake has extensive documentation. I do not begin to presume you will read much of it, but as usual, it is worth getting oriented to over the course of 5-10 minutes. You may then create a config file (snakemake.yaml) to program via Mamba:

name: snakemake
channels: 
  - conda-forge 
  - bioconda 
  - nodefaults 
dependencies:
  - snakemake >=9

Because Snakemake is larger and has more dependencies than most if not all tools we have installed yet this semester, creating an environment on the log-in node can be an extremely slow process. Instead, I recommend creating an .sbatch file that loads the Mamba module, makes it callable, and then contains your usual mamba env create command, e.g.:

#SBATCH --account=priority-bioe-591-genomics        
#SBATCH --job-name=snake_install                             
#SBATCH --partition=priority              
#SBATCH --nodes=1                       
#SBATCH --ntasks-per-node=1             
#SBATCH --cpus-per-task=1              
#SBATCH --time=0-00:30:00                 
#SBATCH --output=snake_install-%j.out
#SBATCH --error=snake_install-%j.err


module load Mamba/23.11.0-0;
eval "$(conda shell.bash hook)"
mamba env create -f snakemake.yml

Submit this script and track its progress with sacct. When it has completed, load the environment (conda activate snakemake) and enter (snakemake --help) to confirm the installation was successful.

Assuming that is the case, create a new subdirectory within your personal course materials folder, e.g. mkdir workflow. Within workflow (or whatever you choose to name it), add a log subdirectory (mkdir log), a data subdirectory (mkdir workflow/data), and copy in paired-end reads from a single sample (e.g., cp ~/bioe-591-genomics/course-materials/data/raw_reads/species/sample1_R*.fastq.gz ~/bioe-591-genomics/students/you/workflow/.) Next, create a new blank file named Snakefile (touch Snakefile) and a directory for .sbatch scripts (mkdir run). You will also need to create a new subdirectory for Mamba environment configuration files (mkdir envs), and copy in your old .yaml configuration file for fastp. If you do not have that handy, you may simply create a new file with the name fastp.yaml and paste in the text below:

name: fastp
channels:
  - conda-forge
  - bioconda
dependencies:
  - fastp

Lastly, we will create a configuration file that permits you to use Snakemake with Slurm. Put this in the following set of nested subdirectories:

mkdir profiles
mkdir profiles/slurm
touch profiles/slurm/config.yaml

The file itself should read as follows:

executor: slurm
jobs: 20
latency-wait: 60
default-resources:
  - runtime=30
  - mem_mb=4000
  - slurm_partition=priority
  - slurm_account=priority-bioe-591-genomics

When this is done, you should have a directory structure that looks like this:

workflow/
│
├── data/
│   ├── sample1_R1.fastq.gz
│   └── sample1_R2.fastq.gz
│
├── envs/
│   └── fastp.yaml
│
├── log/
│
├── run/
│
├── profiles/
│   └── slurm/
│       └── config.yaml
│
└── Snakefile

Let’s now establish some clear objectives. To demonstrate one of the simplest possible use cases of Snakemake, we will create a rule that runs fastp on the paired-end reads you copied in the previous step. Add the following text to your Snakefile, taking care to maintain the exact spacing below (Python is famously sensitive to indentation):

rule fastp:
    conda:
        "envs/fastp.yaml"
    input:
        r1="data/sample1_R1.fastq.gz",
        r2="data/sample1_R2.fastq.gz"
    output:
        r1="results/clean/sample1_R1.clean.fastq.gz",
        r2="results/clean/sample1_R2.clean.fastq.gz"
    shell:
        """
        fastp \
        -i {input.r1} -I {input.r2} \
        -o {output.r1} -O {output.r2}
        """

Edit sample names as necessary. Note that the shell: block of the rule contains (within triple quotations) the same basic fastp command you will remember from a few classes previously, albeit with abbreviated filenames in braces ({input.r1}) in the place of absolute or relative paths. These are an example of a wildcard—conceptually similar to the * operator you have been using in bash—and we will dive into them in greater detail below.

In the run/ subdirectory of workflow, create a new sbatch script with the typical header, edited for the particulars of your run, and add the command snakemake --profile profile/slurm --use-conda, e.g.:

#SBATCH --account=priority-bioe-591-genomics        
#SBATCH --job-name=snakemake                             
#SBATCH --partition=priority              
#SBATCH --nodes=1                     
#SBATCH --ntasks-per-node=1             
#SBATCH --cpus-per-task=1              
#SBATCH --time=0-00:30:00                 
#SBATCH --output=log/snakemake-%j.out
#SBATCH --error=log/snakemake-%j.err

module load Mamba/23.11.0-0;
eval "$(conda shell.bash hook)"
snakemake --profile profile/slurm --use-conda

In theory, you should be ready to run the workflow. Before we do so, however, it is useful to perform a so-called dry run by invoking the same command with the addition of the argument -n. Doing so will test whether your paths and Snakemake syntax is correct and preview the steps it will run—something particularly helpful when they are not what you expect! Conveniently, this can be done on the login node (though recall you will need to make Mamba and the conda command accessible via module load Mamba/23.11.0-0; eval "$(conda shell.bash hook)"):

snakemake --profile profile/slurm --use-conda -n

If successful, you should see output that begins like this, summarizing the number of jobs on tap:

Using profile profile/slurm for setting default command line arguments.
host: tempest-login
Building DAG of jobs...
Conda environment envs/fastp.yaml will be created.
Job stats:
job      count
-----  -------
fastp        1
total        1

The latter half indicates that rule fastp will indeed be run, as it is missing output files:

rule fastp:
    input: data/sample1_R1.fastq.gz, data/sample1_R2.fastq.gz
    output: results/clean/sample1_R1.clean.fastq.gz, results/clean/sample1_R2.clean.fastq.gz
    jobid: 0
    reason: Missing output files: results/clean/sample1_R1.clean.fastq.gz, results/clean/sample1_R2.clean.fastq.gz
    resources: mem_mb=4000, mem_mib=3815, disk_mb=<TBD>, tmpdir=<TBD>, runtime=30, slurm_partition=<TBD>, slurm_account=<TBD>
Job stats:
job      count
-----  -------
fastp        1
total        1

Reasons:
    (check individual jobs above for details)
    output files have to be generated:
        fastp
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Once your dry run command produces the above text, you can execute the real script, being careful to do so from the root of the workflow/ directory to ensure relative paths are not broken (e.g., sbatch run/snakemake.sbatch). The job should run relatively quickly, producing the fastp.html and fastp.json files you inspected earlier. It will also output trimmed reads in the results/clean/ subdirectory, with the names given by the output: block of rule fastp in your Snakefile.

Though it is always satisfying to have a complicated piece of software run successfully, such a simple example does not clearly demonstrate one of the primary benefits of Snakemake—the ability to robustly apply commands to multiple samples simultaneously. You’ve come up with various solutions to this so far, ranging from copy-paste to the asterices in filenames, but may have also been burned by the difficulty of doing so safely and efficiently. We will now introduce Snakemake’s wildcard syntax, re-running fastp on multiple files simultaneously. To do so, you will want to copy at least one other sample’s paired end reads into workflow/data.

Next, edit your Snakefile using nano. In simplified the example below, the wildcard {sample} will match any character string in a file that ends _R1.fastq.gz. (For most of you, _R1_001.fastq.gz and _R2_001.fastq.gz will be better choices.) We specify our samples with a vector labeled SAMPLES, which comes into play directly in the biggest addition to our Snakefile—the new rule all, which defines the end point of the workflow. Here, Snakemake simply looks for “input” files that indicate the entire pipeline is complete. In our case, these are trimmed fastq.gz files, which are listed using the software’s expand() helper function.

SAMPLES = ["sample1","sample2"]

rule all:
    input:
        expand("results/clean/{sample}_R1.clean.fastq.gz", sample=SAMPLES),
        expand("results/clean/{sample}_R2.clean.fastq.gz", sample=SAMPLES)

rule fastp:
    conda:
        "envs/fastp.yaml"
    input:
        r1="data/{sample}_R1.fastq.gz",
        r2="data/{sample}_R2.fastq.gz"
    output:
        r1="results/clean/{sample}_R1.clean.fastq.gz",
        r2="results/clean/{sample}_R2.clean.fastq.gz"
    shell:
        """
        fastp \
        -i {input.r1} -I {input.r2} \
        -o {output.r1} -O {output.r2}
        """

Once your paths and filenames have been changed as needed, attempt another dry run (snakemake --profile profile/slurm --use-conda -n). If it is successful, you will note that the count of the total row in the job stats section of the output now equals the number of samples you copied into your data/ subdirectory. This indicates you are ready to run the .sbatch script for real.

To conclude our Snakemake tutorial, we will expand our Snakefile to include a second rule: index, which will run samtools faidx to generate a .fasta index file. First, you’ll need to create a genome/ subdirectory in workflow/, and copy over hemoglobin_references.fasta into it. You’ll also want to copy over the align.yaml environment into envs/. For convenience, I’ll reproduce that here:

name: align
channels:
  - conda-forge
  - bioconda
dependencies:
  - bwa=0.7.17
  - samtools=1.19

Then, add the following rule to the end of your Snakefile (i.e., one blank line below the end of rule fastp):

rule index:
    input:
        "genome/hemoglobin_references.fasta"
    output:
        "genome/hemoglobin_references.fasta.fai"
    conda:
        "envs/align.yaml"
    shell:
        """
        samtools faidx {input}
        """

If your indentation is O.K., you should be able to give the workflow a dry run immediately. Likely as not, you’ll see the following output:

Building DAG of jobs...
Nothing to be done (all requested files are present and up to date).

But there’s a new rule!, I can hear you gasp. How can this be?! The answer is that rule all has a target (trimmed .fastq.gz files) that does not directly or indirectly invoke the output of rule index. To continue with your workflow, you’ll need to update rule all to reflect the new endpoint, i.e.:

rule all:
    input:
        "genome/hemoglobin_references.fasta.fai"

Once you’ve done so, you should be able to perform both dry and real runs, generating the desired .fai file. (Note that rule all will need to be updated again as you expand your pipeline to include downstream steps, e.g. variant calling and filtering.)

Now that you have two rules, it is worth introducing Snakemake’s directed acyclic graph generator. Run the following command from the root of workflow/:

snakemake --dag | dot -Tpng > dag.png

You can view this by transferring the file to your laptop via Globus, or via the Tempest Web portal. Though unimpressive now, the output will be increasingly helpful as your Snakefile grows in complexity. Similarly, you can generate a report that includes a DAG, versioning, and runtime info:

snakemake --report report.html

Alas, if you take the time to view either of the previous files, you will notice that now, rule fastp will be ignored. This is again because rule all specifies target files do not require it. We can solve this problem by specifying multiple targets as follows:

SAMPLES = ["sample1", "sample2"]

rule all:
    input:
        "genome/hemoglobin_references.fasta.fai",
        expand("results/clean/{sample}_R1.clean.fastq.gz", sample=SAMPLES),
        expand("results/clean/{sample}_R2.clean.fastq.gz", sample=SAMPLES)

(However, this will not be necessary if a subsequent rule takes both as input—which, of course, is exactly what is needed for bwa mem.)

Thus concludes our unit on pipelines. Though our introduction to Snakemake has been necessarily cursory, I hope it has demonstrated its potential power, and sets you off on a lifetime of productive. bioinformatics. Good luck, be safe, and have fun.

Homework 8

The goal of this week’s homework—due the Thursday following Spring Break—is to translate the past four weeks of lab activities into a single Snakemake pipeline than can be successfully run on at least 2 of your samples. The tutorial above should get you a good way towards that goal. At minimum, however, you’ll need to add rules that align reads, call variants, and filter variants.

Update env/ subdirectory with the necessary .yaml config files for Mamba environments.
Update your Snakefile with rules covering all remaining steps in your pipeline.
Confirm that the pipeline works with a dry run, troubleshooting as needed. Once any issues are resolved, execute the dry run command again, saving the output to a text file using the > operator.
Run the pipeline on your actual data. (As mentioned above, you may consider only using two samples to speed up processing time). Create an .html Snakemake report file.
Add, commit, and push your Snakefile, dry run output file, and .html report to GitHub. Organize directories and revise your README.md as per usual.

Bioinformatics Pipelines with Snakemake

Bioinformatics Pipelines with `Snakemake`