Config YAML

This page details the various configuration options and describes how to configure a new workflow. Refer to the Configuration details section for general information about configuring multiome-wf.

While it is possible to use Snakemake mechanisms such as --config to override a particular config value and --configfile to update the config with a different file, it is easiest to edit the existing config.yaml in place. This has the additional benefit of reproducibility because all of the config information is stored in one place.

The config file uses YAML format, which can be conceptualized as a set of nested key:value pairs. When running the workflow, the YAML document is parsed into a python dictionary.

By specifying values in various setions of the config.yaml, the workflow automatically decides to run analysis variants suitable for scRNA-Seq, scATAC-Seq, or multi-modal experiments. With this in mind, there are 2 important points to keep in mind when creating a config.yaml.

1. Activating Rules

The following rules are optional:

chromsizes/merge_macs_prep/macs2/add_macs_peaks/bigwig_signal/bigwig_noise for MACS2 peak calling
integrate for dataset integration using Seurat
chooser_paral/chooser_aggr for computing optimal resolution in clustering using chooseR
diff_analysis for marker gene computation
weigted_nn for Weighted Nearest Neighbor Analysis using Seurat (cross-modality integration)

These rules have discrete sections in the config.yaml where users configure the execution of each rule. Refer to the following instruction to activate or inactivate each rule:

# Activate MACS2 peak calling
macs2:
  run: "Y"

# Activate dataset integration
integrate:
  activate: true

# Activate chooseR
cluster:
  resolution: null

# Activate marker gene computation
diff_analysis:
  activate: true

# Activate Weighted Nearest Neighber
weighted_nn:
  activate: true

Note that the chooser_paral and chooser_aggr rules only run when no pre-defined resolution is provided by the user in the cluster section.

2. Analysis Groups

Because of the myriad variants for single cell analysis and preprocessing, it is not possible to hard-code all the configuration options in the config.yaml file. Instead, we include analysis “group names” in many sections. These rules will have a field named group. Each group must contain a nested dictionary for each analysis variant.

To configure these sections, the user must specify the top-level dictionary key value. All other keys are hard-coded as options.

Using the normalize section as an example, we see a single analysis group below. The group value, unintegrated_0, is itself a dictionary key for this analysis variant (a modality for RNA in Multiome). This group’s dictionary contains additional fields which together define the groups’ analysis options: assay_name, and norm_method.

groups:
  unintegrated_0:
    assay_name: Gene.Expression
    norm_method: sct
  unintegrated_1:
    assay_name: Peaks
    norm_method: lsi
  unintegrated_2:
    assay_name: MACS
    norm_method: lsi
  unintegrated_3:
    assay_name: Gene.Activity
    norm_method: log

Note

It is possible to specify more analysis groups than the number of assays in your data. Do not specify analysis groups unless your experiment setup supports the condition.

For example, in the example config.yaml file, the differential analysis section, diff_analysis contains 2 group key names, unintegrated_0 and integrated_0, if you are not performing Seurat integration by setting the activate key to false in the integrate section, delete the integrated_* group in the rest of the sections. If their are superflous groups in the config.yaml, Snakemake will add extra, unwanted rules/jobs when building a DAG.

Field descriptions

Config Tables

`samples` field

string, default samples.tsv. Defines path to sampletable. See Samples Table for more.

Example:
samples: "config/multiome-config/samples.tsv"

# OR
# samples: "config/atac-config/samples.tsv" for scATAC-seq
# samples: "config/rna-config/samples.tsv" for scRNA-seq

`aggregates` field

string, default aggregates.tsv. Defines path to aggregates table. If you are using aggregated input of multiple samples created using cellranger-arc aggr (Multiome), cellranger-atac aggr (scATAC-seq), or cellranger aggr (scRNA-seq), specify the path to aggregates.tsv. Otherwise, set to an empty string (""). See Aggregates Table for more.

Example:
assays: "config/multiome-config/aggregates.tsv"

# OR
# assays: "config/atac-config/aggregates.tsv" for scATAC-seq
# assays: "config/rna-config/aggregates.tsv" for scRNA-seq

`assays` field

string, default assays.tsv. Defines path to assays table. If you are using custom counts matrices, specify path to assays.tsv. Otherwise, set to an empty string (""). See Assays Table for more.

Example:
assays: "config/multiome-config/assays.tsv"

# OR
# assays: "config/atac-config/assays.tsv" for scATAC-seq
# assays: "config/rna-config/assays.tsv" for scRNA-seq

Annotation

`ANNOTATION` field

string of "EnsDb" or "GTF", default "EnsDb". Defines the method to build an annotation object (GenomicRanges) for scATAC-seq and Multiome analyses.

"EnsDb" uses the EnsDb.Mmusculus.v79 (mouse mm10) or EnsDb.Hsapiens.v86 (human hg38) package in R

"GTF" uses a user-provided annotation file

`ANNO_FILE` field

string, default "path/to/genes.gtf.gz". If "GTF" is specified in the ANNOTATION field, provide the path to your annotation file (e.g. genes.gtf.gz). This field is disregarded if the ANNOTATION field is set to "EnsDb".

Quality Control (`qc` section)

`remove_outliers` field

boolean, default true. Specify whether or not to run qc rule.

`rm_outliers_method` field

string of "sd" or "iqr", default "sd". Detect outliers using either standard deviation ("sd"), or Tukey’s interquartile range ("iqr"). If set to "sd", the thresholds are determined based on +/- 3 standard deviations.

`meta_labels` field

list. Which metadata columns to use for filtering?

See Samples Table for more details about how metadata columns are detected. If a value is specified in this field, but is not present in the data, it will be disregarded during filtering.

`lower` field

dict. Key:value pairs of metadata column and associated lower limit for cutoff, below which exclude cells. If specified, overrides lower limit detected by outlier method for associated metadata columns in meta_labels. If null, outlier method tries to remove cells automatically

`upper` field

dict. Key:value pairs of metadata column and associated upper limit for cutoff, above which exclude cells. If specified, overrides lower limit detected by outlier method for associated metadata columns in meta_labels. If null, outlier method tries to remove cells automatically

Note

In the example below, 3 metadata columns are specified. 3 have hard cut-offs (nCount_Gene.Expression, nCount_Peaks, and TSS.enrichment), 1 detects lower outliers automatically (percent.mt).
10X Genomics ATAC and Multiome kits use nuclei, so reads will not map to mitochondria. However, the workflow imputes a value of 0 for percent.mt in these assays, since missing values are not generally allowed in the underlying packages. This will not effect downstream processes such as normalization, dimensional reduction, clustering, etc.

Example:

qc:
  remove_outliers: true
  rm_outliers_method: sd
  meta_labels:
    - nCount_Gene.Expression
    - nCount_Peaks
    - percent.mt
    - TSS.enrichment
  lower:
    nCount_Gene.Expression: 100
    nCount_Peaks: 1000
    TSS.enrichment: 2
  upper: null

MACS Peak Calling (`macs2` section)

MACS specific parameters.

`run` field

string of "Y" or "N", default "Y". Determine whether or not to run MACS. Set to "N" for RNA-seq. Set to "Y" for ATAC and Multiome requiring MACS peak calling. If you don’t run MACS, delete analysis groups where assay_name corresponds to MACS in the remaining sections/fields (e.g. unintegrated_2).

`group_fragments_by` field

string, default "genome". samples.tsv metadata column to generate fragments file. All labels in the specified column must have the same value. This forces generation of a single fragments file for MACS peak calling. Do not change this setting unless under special circumstances.

`fasta` field

string. A path to the FASTA reference genome used to map sequencing reads. Cell Ranger users can specify fasta/genome.fa in the reference directory that was used to run cellranger-atac count or cellranger-arc count.

`chromsizes` field

string, default "../reference/multiome.chromsizes" (Multiome) or "../reference/atac.chromsizes" (ATAC). A path to the .chromsizes file created from the FASTA reference genome.

Example:

macs2:
  run: "Y"
  group_fragments_by: genome
  fasta: "../reference/genome.fa"
  chromsizes: "../reference/multiome.chromsizes"

Normalization (`normalize` section)

Normalization and Principal Component Analysis (PCA).

`split_by` field

string. A metadata column in samples.tsv or aggregates.tsv. Datasets will be normalized, and dimensionality reduction using PCA will be performed on each dataset, split by this column. Note that Seurat integration will be performed based on the metadata column specified by split_by here and in the integrate section.

`groups` field

dict. Each group to perform normalization. Group name (key) must be unique. Do not modify the prefix (e.g. unintegrated and integrated) unless under special circumstances.

`assay_name` field

string of Gene.Expression, Multiplexing.Capture, Peaks, Gene.Activity, or MACS. Which Seurat assay to use. Note that Seurat assay names are “.” delimited.

`norm_method` field

string of log, sct, clr or lsi, default is the following:

Gene.Expression: sct

Peaks: lsi

MACS: lsi

Gene.Activity: log

protein: clr

Method to normalize the group’s assay. Normalize using Log (log), SCTransform (sct), Centered log ratio (clr) or latent semantic indexing (lsi). Typically, 5’ or 3’ Gene expression is normalized using Log or SCTransform methods, ATAC Peaks using LSI, and protein using CLR.

Example:

normalize:
  split_by: meta_geno
  groups:
    unintegrated_0:
      assay_name: Gene.Expression
      norm_method: sct
    unintegrated_1:
      assay_name: Peaks
      norm_method: lsi
    unintegrated_2:
      assay_name: MACS
      norm_method: lsi
    unintegrated_3:
      assay_name: Gene.Activity
      norm_method: log

Integration (`integrate` section)

Remove technical/batch effects using Seurat integration methods. Integration rule will create a new Seurat object for each integration performed.

`activate` field

boolean, default true. Specify whether or not to run integration.

`atac_integrate_embeddings` field

boolean, default true. If true, integrate low-dimensional cell embeddings (LSI coordinates) across the datasets. This is the best option for integrating multiple ATAC Peaks data sets. If false, integrate (transform) ATAC Peaks counts matrix across datasets (not LSI coordinates). This may over fit. Kept mainly for legacy support.

`split_by` field

string. A metadata column in samples.tsv or aggregates.tsv. Datasets are integrated based on this column. Ensure the same column is specified as in the split_by field of the normalize section above.

See Samples Table for more details about how metadata columns are detected.

`groups` field

dict. Each group to perform integration. Group name (key) must be unique.

`assay_name` field

string of Gene.Expression, Multiplexing.Capture, Peaks, Gene.Activity, or MACS. Ensure the same values are specified as in the normalize section above.

`norm_method` field

string of log, sct, clr or lsi. Ensure the same values are specified as in the norm_method field of the normalized section above.

`integrate_method` field

string of CCAIntegration, RPCAIntegration, HarmonyIntegration, FastMNNIntegration, scVIIntegration, or rlsi. Method to integrate unimodal datasets. For any datasets where norm_method is set to log or sct, this string is passed into the method argument of the IntegrateLayers function of Seurat. If the norm_method is set to lsi, set the integrate_method to rlsi to call the IntegrateEmbeddings function, as provided in Signac.

`integrate_dims` field

list of 2 integers, default [1, 30]. Range of dimensions to use for integration step.

Example:

integrate:
  activate: true
  atac_integrate_embeddings: true
  split_by: meta_geno # this has to match the column name of your metadata indicating datasets for integrati
  groups:
    integrated_0:
      assay_name: Gene.Expression
      norm_method: sct
      integrate_method: CCAIntegration  # RPCAIntegration, HarmonyIntegration, FastMNNIntegration, scVIIntegration
      integrate_dims:
        - 1
        - 30
    integrated_1:
      assay_name: Peaks
      norm_method: lsi
      integrate_method: rlsi
      integrate_dims:
        - 1
        - 30
    integrated_3:
      assay_name: Gene.Activity
      norm_method: log
      integrate_method: CCAIntegration
      integrate_dims:
        - 1
        - 30

Utilization of Toy Dataset (`dataset_size` config section)

Assign the utilization of toy dataset. Users can take advantage of this functionality for technical purposes such as debugging. If dataset size is smaller than default k values in kNN computation during integration, Seurat throws an error.

`toydataset` field

boolean, default false. if true, the computation is adjusted to handle toy datasets. if false, the input datasets are considered as normal datasets.

`toy_k` field

integer, number of neighbors used when weighting anchors. This value is passed to the k.weight argument in the IntegrateLayers function during integration.

Example:

dataset_size:
  toydataset: false
    toy_k: 10

Cluster Optimization (`chooser` config section)

Users can optimize clustering modularity using ChooseR with pipeline-specific modifications. This functionality is enabled only if the resolution field in the cluster section is set to null.

`groups` field

dict. Each group to perform clustering parameter optimization. Group name (key) must be unique. All groups in the normalize and integrate (if applicable) sections can be assigned.

`npcs` field

integer, default values:

Gene.Expression: 25

Peaks, MACS, or Gene.Activity: 20

The maximum number of linear reduced dimensions, computed from LSI or PCA, that are used during clustering.

`resolutions` field

list of integers, default [0.6, 0.8, 1, 1.2, 1.4]. Resolutions to use when bootstrapping cluster methods. Best to have a range spanning target resolution.

Warning

Specifying 1.0 instead of 1 can cause an error.

`silhouette` field

list of strings. default silhouette, frequency_grouped, and silhouette_grouped. Values are used during path parameter expansion in rules executing chooseR. It is advisable to not alter them.

Note

All groups values specified in the config sections: normalize and integrate (if appicable) must have a group entry in the chooser config section.

Example:

chooser:
  groups:
    unintegrated_0:
      npcs: 25
    unintegrated_1:
      npcs: 20
    unintegrated_2:
      npcs: 20
    unintegrated_3:
      npcs: 20
    integrated_0:
      npcs: 25
    integrated_1:
      npcs: 20
    integrated_2:
      npcs: 20
    integrated_3:
      npcs: 20
  resolutions:
    - 0.6
    - 0.8
    - 1
    - 1.2
    - 1.4
  silhouette:
    - silhouette
    - frequency_grouped
    - silhouette_grouped

Clustering (`cluster` config section)

Users can determine a specific resolution for clustering or or rely on a dataset-optimized resolution computed using chooser.

`detection_method` field

integer, default 3. Algorithm used for community detection during unimodal clustering. Available options are:

1: original Louvain algorithm

2: Louvain algorithm with multilevel refinement

3: SLM algorithm

4: Leiden algorithm (requires the leidenalg python)

This value is passed to the algorithm argument of the FindClusters function. Refer to Seurat Cluster Determination for more details.

`resolution` field

float or null, default null. If null, clustering is performed using an optimized resolution computed by chooser.

Example:

cluster:
  detection_method: 3
  resolution: null

Weighted Nearest Neighbor (`weighted_nn` config section)

This section configures how to perform Weighted Nearest Neighbor (WNN) analysis. WNN is similar to shared nearest neighbor (SNN), which is commonly used to build graphs for multiple modalities. WNN uses a list of weights from each specified modality, and is useful for incorporating low dimensional embeddings from multiple single cell modalities into a global reduced dimensional space.

Note

All cells for specified assays/groups must have identical barcodes, meaning this rule is currently suitable ONLY for multimodal data. For example 3’ Gene Expression + CRISPR barcodes (Perturb-Seq), 3’ Gene Expression + Protein barcodes (CITE-Seq), 10X Genomics Multiome (Gene Expression + ATAC), etc.
Disable this functionality if the input dataset is not multimodal.

`activate` field

boolean, default true (Multiome) or false (RNA/ATAC). Specify whether or not to run the coembed rule.

`groups` field

dict. Each group to perform weighted nearest neighbor analysis. Group name (key) must be unique.

`input_groups` field

list of strings, default:

wnn_0: unintegrated_0 and unintegrated_1

wnn_1: integrated_0 and integrated_1

groups dictionary values from normalize and integrate config sections. Remember, unless performing multimodal integration, each group value corresponds to an assay. So in our example from the normalize config section, specifying unintegrated_0 and unintegrated_1 would combine the reduced dimensional weights of Gene.Expression and Peaks during WNN clustering.

`reduction` field

list of strings, default pca and lsi. Dimensionality reduction method used for a specified group. In our example from the normalize config section, specifying unintegrated_0 and unintegrated_1 would look for Gene.Expression reduced dimensions in the pca slot and Peaks reduced dimensions in the lsi slot during WNN clustering.

`umap_dims` field

list of integers, default [[1, 25], [2, 20]]. Dimensions to use for UMAP visualization for a specified group.

`resolution` field

integer, default 0.6. Resolution to use during community detection for multimodal clustering.

`detection_method` field

integer, default 3. Algorithm used for community detection during multimodal clustering. Refer to the detection_method field in the cluster section above.

Example:

weighted_nn:
  activate: true
  groups:
    wnn_0:
      input_groups:
        - unintegrated_0 # corresponds to SCT
        - unintegrated_1 # corresponds to Peaks
      reduction:
        - pca
        - lsi
      umap_dims:
        - - 1
          - 25
        - - 1
          - 20
      resolution: 0.6
      detection_method: 3
    wnn_1:
      input_groups:
        - integrated_0 # corresponds to SCT
        - integrated_1 # corresponds to Peaks
      reduction:
        - integrated_pca
        - integrated_lsi
      umap_dims:
        - - 1
          - 25
        - - 1
          - 20
      resolution: 0.6
      detection_method: 3

Differential Testing (`diff_analysis` config section)

This section configures differential testing (i.e. differential gene expression, chromatin accessibility, TF motifs) using the FindAllMarkers function in Seurat.

`activate` field

boolean, default true. Specify whether or not to run differential testing.

`groups` field

dict. Each group to perform differential testing. Group name (key) must be unique.

`cluster_idents` field

string, default seurat_clusters. Which Seurat metadata column to use as labels for differential testing. Equivalent to obj <- SetIdents(cluster_idents) before running FindAllMarkers(obj).

`assay` field

string, default null. Which assay use for differential testing. This value is passed to the assay argument of the FindAllMarkers function.

`slot` field

string, default data. Which slot to pull data from. This value is passed to the slot argument of the FindAllMarkers function.

`min_pct` field

string, default null. Only test genes that are detected in a minimum fraction of cells in either of the two populations. If null, a default value of 0.01 is applied. This value is passed to the min.pct argument of the FindAllMarkers function.

`test_use` field

string, default null. Test used for differential testing. This value is passed to the test.use argument of the FindAllMarkers function. If null, the Wilcoxon Rank Sum test is used by default. Available methods are:

wilcox: Wilcoxon Rank Sum test

wilcox_limma: Limma implementation of the Wilcoxon Rank Sum test (Use this to reproduce results from Seurat v4)

bimod: Likelihood-ratio test

roc: ROC analysis

t: Student’s t-test

negbinom: Negative binomial generalized linear model

poisson: Poisson generalized linear model

LR: Logistic regression model

MAST: MAST framework.

DESeq2: DESeq2 framework. (requires to install DESeq2 package in R)

For more details, refer to the FindAllMarkers function in Seurat.

`latent_vars` field

string, default null. Variables to test, used only when test_use is one of LR, negbinom, poisson, or MAST. This value is passed to the latent.vars argument of the FindAllMarkers function.

`alpha` field

float, default 0.05. False discovery rate (FDR) threshold to filter significant marker genes.

Note

Only include groups values specified in the config sections: normalize, integrate (if appicable) and weighted_nn (if appicable).

Warning

For the current version of multiome-wf, LR has a bug where it grabs more nodes than allocated on a cluster node. Do not use LR on a cluster node.

Example:

diff_analysis:
  activate: true
  groups:
    unintegrated_0:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: null
      test_use: null
      latent_vars: null
      alpha: 0.05
    unintegrated_1:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_Peaks'
      alpha: 0.05
    unintegrated_2:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_MACS'
      alpha: 0.05
    unintegrated_3:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_Gene.Activity'
      alpha: 0.05
    integrated_0:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: null
      test_use: null
      latent_vars: null
      alpha: 0.05
    integrated_1:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_Peaks'
      alpha: 0.05
    integrated_3:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_Gene.Activity'
      alpha: 0.05
    wnn_0:
      cluster_idents: seurat_clusters
      assay: SCT
      slot: data
      min_pct: null
      test_use: null
      latent_vars: null
      alpha: 0.05
    wnn_1:
      cluster_idents: seurat_clusters
      assay: SCT
      slot: data
      min_pct: null
      test_use: null
      latent_vars: null
      alpha: 0.05

Example

A basic example of a config.yaml file using 2 Multiome batches is provided below. The analysis will be performed on all samples with and without integration, followed by clustering and differential testing. This example also includes automated optimization of clustering parameters.

See Overview of workflows for more detailed examples of config files.

samples: config/multiome-config/samples.tsv

aggregates: config/multiome-config/aggregates.tsv

assays: config/multiome-config/assays.tsv

ANNOTATION: "EnsDb"
ANNO_FILE: "path/to/genes.gtf.gz"

qc:
  remove_outliers: true
  rm_outliers_method: sd
  meta_labels:
    - nCount_Gene.Expression
    - nCount_Peaks
    - percent.mt
    - TSS.enrichment
  lower:
    nCount_Gene.Expression: 100
    nCount_Peaks: 1000
    TSS.enrichment: 2
  upper: null

macs2:
  run: "Y"
  group_fragments_by: genome
  fasta: "../reference/genome.fa"
  chromsizes: "../reference/multiome.chromsizes"

normalize:
  split_by: meta_geno
  groups:
    unintegrated_0:
      assay_name: Gene.Expression
      norm_method: sct
    unintegrated_1:
      assay_name: Peaks
      norm_method: lsi
    unintegrated_2:
      assay_name: MACS
      norm_method: lsi
    unintegrated_3:
      assay_name: Gene.Activity
      norm_method: log

integrate:
  activate: true
  atac_integrate_embeddings: true
  split_by: meta_geno
  groups:
    integrated_0:
      assay_name: Gene.Expression
      norm_method: sct
      integrate_method: CCAIntegration
      integrate_dims:
        - 1
        - 30
    integrated_1:
      assay_name: Peaks
      norm_method: lsi
      integrate_method: rlsi
      integrate_dims:
        - 1
        - 30
    integrated_3:
      assay_name: Gene.Activity
      norm_method: log
      integrate_method: CCAIntegration
      integrate_dims:
        - 1
        - 30

dataset_size:
  toydataset: false
  toy_k: 10

chooser:
  groups:
    unintegrated_0:
      npcs: 25
    unintegrated_1:
      npcs: 20
    unintegrated_2:
      npcs: 20
    unintegrated_3:
      npcs: 20
    integrated_0:
      npcs: 25
    integrated_1:
      npcs: 20
    integrated_2:
      npcs: 20
    integrated_3:
      npcs: 20
  resolutions:
    - 0.6
    - 0.8
    - 1
    - 1.2
    - 1.4
  silhouette:
    - silhouette
    - frequency_grouped
    - silhouette_grouped

cluster:
  detection_method: 3
  resolution: null

weighted_nn:
  activate: true
  groups:
    wnn_0:
      input_groups:
        - unintegrated_0 # corresponds to SCT
        - unintegrated_1 # corresponds to Peaks
      reduction:
        - pca
        - lsi
      umap_dims:
        - - 1
          - 25
        - - 1
          - 20
      resolution: 0.6
      detection_method: 3
    wnn_1:
      input_groups:
        - integrated_0 # corresponds to SCT
        - integrated_1 # corresponds to Peaks
      reduction:
        - integrated_pca
        - integrated_lsi
      umap_dims:
        - - 1
          - 25
        - - 1
          - 20
      resolution: 0.6
      detection_method: 3

diff_analysis:
  activate: true
  groups:
    unintegrated_0:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: null
      test_use: null
      latent_vars: null
      alpha: 0.05
    unintegrated_1:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_Peaks'
      alpha: 0.05
    unintegrated_2:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_MACS'
      alpha: 0.05
    unintegrated_3:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_Gene.Activity'
      alpha: 0.05
    integrated_0:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: null
      test_use: null
      latent_vars: null
      alpha: 0.05
    integrated_1:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_Peaks'
      alpha: 0.05
    integrated_3:
      cluster_idents: seurat_clusters
      assay: null
      slot: data
      min_pct: 0.2
      test_use: null
      latent_vars: 'nCount_Gene.Activity'
      alpha: 0.05
    wnn_0:
      cluster_idents: seurat_clusters
      assay: SCT
      slot: data
      min_pct: null
      test_use: null
      latent_vars: null
      alpha: 0.05
    wnn_1:
      cluster_idents: seurat_clusters
      assay: SCT
      slot: data
      min_pct: null
      test_use: null
      latent_vars: null
      alpha: 0.05

Config YAML

1. Activating Rules

2. Analysis Groups

Field descriptions

Config Tables

samples field

aggregates field

assays field

Annotation

ANNOTATION field

ANNO_FILE field

Quality Control (qc section)

remove_outliers field

rm_outliers_method field

meta_labels field

lower field

upper field

MACS Peak Calling (macs2 section)

run field

group_fragments_by field

fasta field

chromsizes field

Normalization (normalize section)

split_by field

groups field

assay_name field

norm_method field

Integration (integrate section)

activate field

atac_integrate_embeddings field

split_by field

groups field

assay_name field

norm_method field

integrate_method field

integrate_dims field

Utilization of Toy Dataset (dataset_size config section)

toydataset field

toy_k field

Cluster Optimization (chooser config section)

groups field

npcs field

resolutions field

silhouette field

Clustering (cluster config section)

detection_method field

resolution field

Weighted Nearest Neighbor (weighted_nn config section)

activate field

groups field

input_groups field

reduction field

umap_dims field

resolution field

detection_method field

Differential Testing (diff_analysis config section)

activate field

groups field

cluster_idents field

assay field

slot field

min_pct field

test_use field

latent_vars field

alpha field

Example

`samples` field

`aggregates` field

`assays` field

`ANNOTATION` field

`ANNO_FILE` field

Quality Control (`qc` section)

`remove_outliers` field

`rm_outliers_method` field

`meta_labels` field

`lower` field

`upper` field

MACS Peak Calling (`macs2` section)

`run` field

`group_fragments_by` field

`fasta` field

`chromsizes` field

Normalization (`normalize` section)

`split_by` field

`groups` field

`assay_name` field

`norm_method` field

Integration (`integrate` section)

`activate` field

`atac_integrate_embeddings` field

`split_by` field

`groups` field

`assay_name` field

`norm_method` field

`integrate_method` field

`integrate_dims` field

Utilization of Toy Dataset (`dataset_size` config section)

`toydataset` field

`toy_k` field

Cluster Optimization (`chooser` config section)

`groups` field

`npcs` field

`resolutions` field

`silhouette` field

Clustering (`cluster` config section)

`detection_method` field

`resolution` field

Weighted Nearest Neighbor (`weighted_nn` config section)

`activate` field

`groups` field

`input_groups` field

`reduction` field

`umap_dims` field

`resolution` field

`detection_method` field

Differential Testing (`diff_analysis` config section)

`activate` field

`groups` field

`cluster_idents` field

`assay` field

`slot` field

`min_pct` field

`test_use` field

`latent_vars` field

`alpha` field