Overview of workflows
Overview
The main goal of multiome-wf is transforming and combining raw data into usable results for downstream analyses. For scRNA-seq, that’s differentially-expressed genes (along with comprehensive QC and analysis). For scATAC-seq, that’s called peaks or differentially accessible chromatin regions.
multiome-wf is a framework that supports a number of analysis variants based on how it is configured. Each class of analysis variants, such as scRNA-Seq or scATAC-Seq is a “core” workflow since a different directed acyclic graph (DAG) is constructed for each analysis variant. Using a single framework promotes flexibility in single cell analyses. Data derived from many different experimental strategies can effectively be combined, and by setting up the configuration files properly, multiome-wf will decide the best workflow to use.
Core pipeline structure
multiome-wf is a parallelized pipeline built using Snakemake. The pipeline consists of
rules that implement modular workflows, as defined in the Snakefile within the
workflow directory. Each rule runs analysis scripts written in R or Bash. R scripts
are executed from individual Rmd files in the same directory, which also creates
analysis reports in html format in the workflow/results folder upon completion of
the analysis.
multiome-wf is configured by plain text YAML
and TSV format files located in the
config directory (see Configuration details for more information).
Optionally, the pipeline can run on a High Performance Computing (HPC) cluster. The
WRAPPER_SLURM file in the workflow directory is specially prepared to configure
high performance computing on NIH’s Biowulf. Refer to
Running on a cluster for more details.
$ tree workflow/
workflow/
├── add_macs_peaks.Rmd
├── annotation_ensdb.Rmd
├── annotation_gtf.Rmd
├── chooser
│ ├── env.yaml
│ ├── R
│ │ └── pipeline.R
│ └── requirements.txt
├── chooser_aggr.Rmd
├── chooser_paral.Rmd
├── cluster.Rmd
├── combine.Rmd
├── common.R
├── config
│ ├── atac-config
│ │ ├── aggregates.tsv
│ │ ├── assays.tsv
│ │ ├── config.yaml
│ │ └── samples.tsv
│ ├── multiome-config
│ │ ├── aggregates.tsv
│ │ ├── assays.tsv
│ │ ├── config.yaml
│ │ └── samples.tsv
│ ├── README.rst
│ └── rna-config
│ ├── aggregates.tsv
│ ├── assays.tsv
│ ├── config.yaml
│ └── samples.tsv
├── create_seurat.Rmd
├── diff_analysis.Rmd
├── integrate.Rmd
├── merge_macs_prep.Rmd
├── merge_zinba.Rmd
├── normalize_reduce_dims.Rmd
├── qc.Rmd
├── README.rst
├── Snakefile
├── weighted_nn.Rmd
└── WRAPPER_SLURM
6 directories, 35 files
The core workflows are:
A number of additional analyses can be added to core workflows, such as quantifyng CRISPR sgRNA barcodes or surface protein associated oligos, by updating relevant config files.
In all cases, search for the string NOTE: in the Snakefile to read notes on
how to configure each rule, and make adjustments as necessary.
Note
If you have two different scATAC-seq experiments, from different species, they
have to be run separately. However, if downstream analyses will use them both
then you would like to keep them in the same project. In this case, you can copy
the workflow directory to two other directories:
$ rsync -rvt workflow/ workflow-genome1-atac/
$ rsync -rvt workflow/ workflow-genome2-atac/
Now, downstream analyses can link to and utilize results from these individual folders, while the whole project remains self-contained.
Features common to workflows
In this section, we will take a higher-level look at the features common to all workflows.
The config file is hard-coded to use one of the following:
workflow/config/multiome-config/config.yaml,workflow/config/atac-config/config.yaml, orworkflow/config/rna-config/config.yaml. This allows the config file to be in theconfigdir with other config files without having to be specified on the command line, while also affording the user flexibility. For instance, a custom config can be specified at the command-line, usingsnakemake --configfile <path to other config file>.The config file is loaded using
common.load_config. This function resolves various paths (especially the references config section) and checks to see if the config is well-formatted.Various files can be used to specify cluster-specific parameters if the workflows are being run in a high-performance cluster environment. For more details, carefully read the section Running on a cluster.