Configuration details

General configuration

The majority of the work in setting up a new project is in the configuration – which samples to run, where the data files are located, which references are needed, etc.

The entry point for configuration is found in workflow/config directory as shown below:

$ tree workflow/config
workflow/config
├── atac-config
│   ├── aggregates.tsv
│   ├── assays.tsv
│   ├── config.yaml
│   └── samples.tsv
├── multiome-config
│   ├── aggregates.tsv
│   ├── assays.tsv
│   ├── config.yaml
│   └── samples.tsv
├── README.rst
└── rna-config
    ├── aggregates.tsv
    ├── assays.tsv
    ├── config.yaml
    └── samples.tsv

3 directories, 13 files

`config.yaml`

This is a snakemake configuration file.

See Config YAML for pipeline-specific configuration.

`samples.tsv`

This is a default sample table containing input file paths and metadata.

See Samples Table for more.

`aggregates.tsv` (Optional)

This table is used to map library barcode labels to library IDs in aggregated input files. The aggregated input contains multiple biological replicates, which can be generated using cellranger-arc aggr (multiome), cellranger-atac aggr (scATAC-seq), or cellranger aggr (scRNA-seq).

See Aggregates Table for more.

`assays.tsv` (Optional)

The assay table is only used if you have calculated custom feature-by-barcode counts matrices. For example, if one has a genome annotation file containing gene coordinates and enhancer coordinates, you can count all ATAC reads mapping to a gene and associated enhancers:

Gene Activity Score = reads mapped within gene + reads mapped to enhancers

See Assays Table for more.

Running on a cluster

The example commands in Getting Started describe running Snakemake locally. For larger data sets, you’ll want to run them on an HPC cluster. Snakemake supports arbitrary cluster commands, making it easy to run these workflows on many different cluster environments.

Snakemake and these workflows are designed to decouple the code from the configuration. If you are running the workflows on NIH’s Biowulf cluster, you don’t need to change anything.

If you are running on a different cluster, you should inspect the following files:

WRAPPER_SLURM
Profile config files from the NIH HPC

The default configuration we provide is specific to the NIH Biowulf cluster. To run a workflow on Biowulf, set up your snakemake profile as instructed in the Cluster configuration. Afterwards, from the top-level project directory, run the following command:

sbatch WRAPPER_SLURM

The WRAPPER_SLURM script submits the main Snakemake process on a separate node to avoid any restrictions from running on the head node. That main Snakemake process then submits each rule separately to the cluster scheduler.

`TMPDIR` handling

The top of each snakefile sets up a shell prefix that exports the TMPDIR variable. The reason for this is that the NIH Biowulf cluster supports nodes with temporary local storage in a directory named after the SLURM job ID. This ID is not known ahead of time, but is stored in the SLURM_JOBID env var.

Since each rule executed on a cluster node calls the snakefile (see the job scripts created by snakemake for more on this), we can look for the job ID and set the tempdir appropriately. Upon setting $TMPDIR, the Python tempfile module will use that directory to store temp files. Any wrappers can additionally use $TMPDIR in shell commands and it will use this directory.

Note that the default behavior – if the SLURM_JOBID env var is not set – is to set $TMPDIR to the default temp directory as documented in Python’s tempfile module. However if you use these workflows on a different cluster, you may need to provide a different function to return the job-specific temp directory.

Cluster configuration

For Snakemake versions after 7.29, we use profile established by NIH HPC. If you’re a first-time user, you can setup your profile on NIH’s Biowulf as demonstrated below:

# Clone the profile repo
git clone https://github.com/NIH-HPC/snakemake_profile.git path/to/snakemake_profile

Once the repository is cloned, add export MULTIOMEWF_SNAKEMAKE_PROFILE="path/to/snakemake_profile" to your bash configuration file (~/.bashrc) as shown below:

# NOTE: This is for first-time users. You do not need to repeat this step if you
#       have set this up before.

# Update your bash config setting to include MULTIOMEWF_SNAKEMAKE_PROFILE
echo 'export MULTIOMEWF_SNAKEMAKE_PROFILE="path/to/snakemake_profile"' >> ~/.bashrc

# Update your bash environment in use
source ~/.bashrc

After completing your initial configuration for HPC utilization, return to the WRAPPER_SLURM file to set the --configfile parameter to point to the config.yaml you’re using.

# In WRAPPER_SLURM:
(
    time snakemake \
    <snakemake_parameters>
    --configfile config/multiome-config/config.yaml \    # IMPORTANT!
    ) > "Snakefile.log" 2>&1

Configuration details

General configuration

config.yaml

samples.tsv

aggregates.tsv (Optional)

assays.tsv (Optional)