.. _config-yaml: Config YAML =========== This page details the various configuration options and describes how to configure a new workflow. Refer to the :ref:`config` section for general information about configuring `multiome-wf`. While it is possible to use Snakemake mechanisms such as ``--config`` to override a particular config value and ``--configfile`` to update the config with a different file, it is easiest to edit the existing ``config.yaml`` in place. This has the additional benefit of reproducibility because all of the config information is stored in one place. The config file uses `YAML `_ format, which can be conceptualized as a set of nested key:value pairs. When running the workflow, the YAML document is parsed into a python dictionary. By specifying values in various setions of the ``config.yaml``, the workflow automatically decides to run analysis variants suitable for scRNA-Seq, scATAC-Seq, or multi-modal experiments. With this in mind, there are 2 important points to keep in mind when creating a ``config.yaml``. 1. Activating Rules ~~~~~~~~~~~~~~~~~~~ The following rules are optional: - ``chromsizes``/``merge_macs_prep``/``macs2``/``add_macs_peaks``/``bigwig_signal``/``bigwig_noise`` for MACS2 peak calling - ``integrate`` for `dataset integration using Seurat `_ - ``chooser_paral``/``chooser_aggr`` for computing optimal resolution in clustering using `chooseR `_ - ``diff_analysis`` for marker gene computation - ``weigted_nn`` for `Weighted Nearest Neighbor Analysis using Seurat `_ (cross-modality integration) These rules have discrete sections in the ``config.yaml`` where users configure the execution of each rule. Refer to the following instruction to activate or inactivate each rule: .. code-block:: yaml # Activate MACS2 peak calling macs2: run: "Y" # Activate dataset integration integrate: activate: true # Activate chooseR cluster: resolution: null # Activate marker gene computation diff_analysis: activate: true # Activate Weighted Nearest Neighber weighted_nn: activate: true Note that the ``chooser_paral`` and ``chooser_aggr`` rules only run when no pre-defined ``resolution`` is provided by the user in the ``cluster`` section. 2. Analysis Groups ~~~~~~~~~~~~~~~~~~ Because of the myriad variants for single cell analysis and preprocessing, it is not possible to hard-code all the configuration options in the ``config.yaml`` file. Instead, we include analysis "group names" in many sections. These rules will have a field named ``group``. Each ``group`` must contain a nested dictionary for each analysis variant. To configure these sections, the user must specify the top-level dictionary key value. All other keys are hard-coded as options. Using the ``normalize`` section as an example, we see a single analysis group below. The group value, ``unintegrated_0``, is itself a dictionary key for this analysis variant (a modality for RNA in Multiome). This group's dictionary contains additional fields which together define the groups' analysis options: ``assay_name``, and ``norm_method``. .. code-block:: yaml groups: unintegrated_0: assay_name: Gene.Expression norm_method: sct unintegrated_1: assay_name: Peaks norm_method: lsi unintegrated_2: assay_name: MACS norm_method: lsi unintegrated_3: assay_name: Gene.Activity norm_method: log .. note:: It is possible to specify more analysis groups than the number of assays in your data. **Do not** specify analysis groups unless your experiment setup supports the condition. For example, in the example ``config.yaml`` file, the differential analysis section, ``diff_analysis`` contains 2 group key names, ``unintegrated_0`` and ``integrated_0``, if you are not performing Seurat integration by setting the ``activate`` key to ``false`` in the ``integrate`` section, delete the ``integrated_*`` group in the rest of the sections. If their are superflous groups in the ``config.yaml``, Snakemake will add extra, unwanted rules/jobs when building a DAG. Field descriptions ~~~~~~~~~~~~~~~~~~ Config Tables ------------- ``samples`` field ^^^^^^^^^^^^^^^^^ string, default ``samples.tsv``. Defines path to sampletable. See :ref:`samples-table` for more. Example: .. code-block:: yaml samples: "config/multiome-config/samples.tsv" # OR # samples: "config/atac-config/samples.tsv" for scATAC-seq # samples: "config/rna-config/samples.tsv" for scRNA-seq ``aggregates`` field ^^^^^^^^^^^^^^^^^^^^ string, default ``aggregates.tsv``. Defines path to aggregates table. If you are using aggregated input of multiple samples created using ``cellranger-arc aggr`` (Multiome), ``cellranger-atac aggr`` (scATAC-seq), or ``cellranger aggr`` (scRNA-seq), specify the path to ``aggregates.tsv``. Otherwise, set to an empty string (``""``). See :ref:`aggregates-table` for more. Example: .. code-block:: yaml assays: "config/multiome-config/aggregates.tsv" # OR # assays: "config/atac-config/aggregates.tsv" for scATAC-seq # assays: "config/rna-config/aggregates.tsv" for scRNA-seq ``assays`` field ^^^^^^^^^^^^^^^^ string, default ``assays.tsv``. Defines path to assays table. If you are using custom counts matrices, specify path to ``assays.tsv``. Otherwise, set to an empty string (``""``). See :ref:`assays-table` for more. Example: .. code-block:: yaml assays: "config/multiome-config/assays.tsv" # OR # assays: "config/atac-config/assays.tsv" for scATAC-seq # assays: "config/rna-config/assays.tsv" for scRNA-seq Annotation ---------- .. _config-annotation: ``ANNOTATION`` field ^^^^^^^^^^^^^^^^^^^^ string of ``"EnsDb"`` or ``"GTF"``, default ``"EnsDb"``. Defines the method to build an annotation object (``GenomicRanges``) for scATAC-seq and Multiome analyses. - ``"EnsDb"`` uses the ``EnsDb.Mmusculus.v79`` (mouse mm10) or ``EnsDb.Hsapiens.v86`` (human hg38) package in R - ``"GTF"`` uses a user-provided annotation file ``ANNO_FILE`` field ^^^^^^^^^^^^^^^^^^^ string, default ``"path/to/genes.gtf.gz"``. If ``"GTF"`` is specified in the ``ANNOTATION`` field, provide the path to your annotation file (e.g. ``genes.gtf.gz``). This field is disregarded if the ``ANNOTATION`` field is set to ``"EnsDb"``. Quality Control (``qc`` section) -------------------------------- ``remove_outliers`` field ^^^^^^^^^^^^^^^^^^^^^^^^^ boolean, default ``true``. Specify whether or not to run ``qc`` rule. ``rm_outliers_method`` field ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ string of ``"sd"`` or ``"iqr"``, default ``"sd"``. Detect outliers using either standard deviation (``"sd"``), or Tukey's interquartile range (``"iqr"``). If set to ``"sd"``, the thresholds are determined based on +/- 3 standard deviations. ``meta_labels`` field ^^^^^^^^^^^^^^^^^^^^^ list. Which metadata columns to use for filtering? See :ref:`samples-table` for more details about how metadata columns are detected. If a value is specified in this field, but is not present in the data, it will be disregarded during filtering. ``lower`` field ^^^^^^^^^^^^^^^ dict. Key:value pairs of metadata column and associated lower limit for cutoff, **below** which exclude cells. If specified, overrides lower limit detected by outlier method for associated metadata columns in ``meta_labels``. If ``null``, outlier method tries to remove cells automatically ``upper`` field ^^^^^^^^^^^^^^^ dict. Key:value pairs of metadata column and associated upper limit for cutoff, **above** which exclude cells. If specified, overrides lower limit detected by outlier method for associated metadata columns in ``meta_labels``. If ``null``, outlier method tries to remove cells automatically .. note:: - In the example below, 3 metadata columns are specified. 3 have hard cut-offs (``nCount_Gene.Expression``, ``nCount_Peaks``, and ``TSS.enrichment``), 1 detects lower outliers automatically (``percent.mt``). - 10X Genomics ATAC and Multiome kits use nuclei, so reads will not map to mitochondria. However, the workflow imputes a value of 0 for ``percent.mt`` in these assays, since missing values are not generally allowed in the underlying packages. This will not effect downstream processes such as normalization, dimensional reduction, clustering, etc. Example: .. code-block:: yaml qc: remove_outliers: true rm_outliers_method: sd meta_labels: - nCount_Gene.Expression - nCount_Peaks - percent.mt - TSS.enrichment lower: nCount_Gene.Expression: 100 nCount_Peaks: 1000 TSS.enrichment: 2 upper: null .. _macs-peakcalling: MACS Peak Calling (``macs2`` section) ------------------------------------- MACS specific parameters. ``run`` field ^^^^^^^^^^^^^ string of ``"Y"`` or ``"N"``, default ``"Y"``. Determine whether or not to run MACS. Set to ``"N"`` for RNA-seq. Set to ``"Y"`` for ATAC and Multiome requiring MACS peak calling. If you don’t run MACS, delete analysis groups where ``assay_name`` corresponds to ``MACS`` in the remaining sections/fields (e.g. ``unintegrated_2``). ``group_fragments_by`` field ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ string, default ``"genome"``. ``samples.tsv`` metadata column to generate fragments file. All labels in the specified column must have the same value. This forces generation of a single fragments file for MACS peak calling. Do not change this setting unless under special circumstances. ``fasta`` field ^^^^^^^^^^^^^^^ string. A path to the FASTA reference genome used to map sequencing reads. Cell Ranger users can specify ``fasta/genome.fa`` in the reference directory that was used to run ``cellranger-atac count`` or ``cellranger-arc count``. ``chromsizes`` field ^^^^^^^^^^^^^^^^^^^^ string, default ``"../reference/multiome.chromsizes"`` (Multiome) or ``"../reference/atac.chromsizes"`` (ATAC). A path to the ``.chromsizes`` file created from the FASTA reference genome. Example: .. code-block:: yaml macs2: run: "Y" group_fragments_by: genome fasta: "../reference/genome.fa" chromsizes: "../reference/multiome.chromsizes" Normalization (``normalize`` section) ------------------------------------- Normalization and Principal Component Analysis (PCA). .. _split-by: ``split_by`` field ^^^^^^^^^^^^^^^^^^ string. A metadata column in ``samples.tsv`` or ``aggregates.tsv``. Datasets will be normalized, and dimensionality reduction using PCA will be performed on each dataset, split by this column. Note that Seurat integration will be performed based on the metadata column specified by ``split_by`` here and in the ``integrate`` section. ``groups`` field ^^^^^^^^^^^^^^^^ dict. Each group to perform normalization. Group name (key) must be unique. Do not modify the prefix (e.g. ``unintegrated`` and ``integrated``) unless under special circumstances. ``assay_name`` field ^^^^^^^^^^^^^^^^^^^^ string of ``Gene.Expression``, ``Multiplexing.Capture``, ``Peaks``, ``Gene.Activity``, or ``MACS``. Which Seurat assay to use. Note that Seurat assay names are "." delimited. .. _norm-method: ``norm_method`` field ^^^^^^^^^^^^^^^^^^^^^ string of ``log``, ``sct``, ``clr`` or ``lsi``, default is the following: - ``Gene.Expression``: ``sct`` - ``Peaks``: ``lsi`` - ``MACS``: ``lsi`` - ``Gene.Activity``: ``log`` - ``protein``: ``clr`` Method to normalize the group's assay. Normalize using Log (``log``), SCTransform (``sct``), Centered log ratio (``clr``) or latent semantic indexing (``lsi``). Typically, 5' or 3' Gene expression is normalized using Log or SCTransform methods, ATAC Peaks using LSI, and protein using CLR. Example: .. code-block:: yaml normalize: split_by: meta_geno groups: unintegrated_0: assay_name: Gene.Expression norm_method: sct unintegrated_1: assay_name: Peaks norm_method: lsi unintegrated_2: assay_name: MACS norm_method: lsi unintegrated_3: assay_name: Gene.Activity norm_method: log Integration (``integrate`` section) ----------------------------------- Remove technical/batch effects using `Seurat integration `_ methods. Integration rule will create a new Seurat object for each integration performed. ``activate`` field ^^^^^^^^^^^^^^^^^^ boolean, default ``true``. Specify whether or not to run integration. ``atac_integrate_embeddings`` field ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ boolean, default ``true``. If ``true``, integrate low-dimensional cell embeddings (LSI coordinates) across the datasets. This is the best option for integrating multiple ATAC Peaks data sets. If ``false``, integrate (transform) ATAC Peaks counts matrix across datasets (not LSI coordinates). This may over fit. Kept mainly for legacy support. ``split_by`` field ^^^^^^^^^^^^^^^^^^ string. A metadata column in ``samples.tsv`` or ``aggregates.tsv``. Datasets are integrated based on this column. Ensure the same column is specified as in the :ref:`split-by` of the ``normalize`` section above. See :ref:`samples-table` for more details about how metadata columns are detected. ``groups`` field ^^^^^^^^^^^^^^^^ dict. Each group to perform integration. Group name (key) must be unique. ``assay_name`` field ^^^^^^^^^^^^^^^^^^^^ string of ``Gene.Expression``, ``Multiplexing.Capture``, ``Peaks``, ``Gene.Activity``, or ``MACS``. Ensure the same values are specified as in the ``normalize`` section above. ``norm_method`` field ^^^^^^^^^^^^^^^^^^^^^ string of ``log``, ``sct``, ``clr`` or ``lsi``. Ensure the same values are specified as in the :ref:`norm-method` of the ``normalized`` section above. ``integrate_method`` field ^^^^^^^^^^^^^^^^^^^^^^^^^^ string of ``CCAIntegration``, ``RPCAIntegration``, ``HarmonyIntegration``, ``FastMNNIntegration``, ``scVIIntegration``, or ``rlsi``. Method to integrate unimodal datasets. For any datasets where ``norm_method`` is set to ``log`` or ``sct``, this string is passed into the ``method`` argument of the ``IntegrateLayers`` function of `Seurat `_. If the ``norm_method`` is set to ``lsi``, set the ``integrate_method`` to ``rlsi`` to call the ``IntegrateEmbeddings`` function, as provided in `Signac `_. ``integrate_dims`` field ^^^^^^^^^^^^^^^^^^^^^^^^ list of 2 integers, default ``[1, 30]``. Range of dimensions to use for integration step. Example: .. code-block:: yaml integrate: activate: true atac_integrate_embeddings: true split_by: meta_geno # this has to match the column name of your metadata indicating datasets for integrati groups: integrated_0: assay_name: Gene.Expression norm_method: sct integrate_method: CCAIntegration # RPCAIntegration, HarmonyIntegration, FastMNNIntegration, scVIIntegration integrate_dims: - 1 - 30 integrated_1: assay_name: Peaks norm_method: lsi integrate_method: rlsi integrate_dims: - 1 - 30 integrated_3: assay_name: Gene.Activity norm_method: log integrate_method: CCAIntegration integrate_dims: - 1 - 30 Utilization of Toy Dataset (``dataset_size`` config section) ------------------------------------------------------------ Assign the utilization of toy dataset. Users can take advantage of this functionality for technical purposes such as debugging. If dataset size is smaller than default k values in kNN computation during integration, Seurat throws an error. ``toydataset`` field ^^^^^^^^^^^^^^^^^^^^ boolean, default ``false``. if ``true``, the computation is adjusted to handle toy datasets. if ``false``, the input datasets are considered as normal datasets. ``toy_k`` field ^^^^^^^^^^^^^^^ integer, number of neighbors used when weighting anchors. This value is passed to the ``k.weight`` argument in the ``IntegrateLayers`` function during integration. Example: .. code-block:: yaml dataset_size: toydataset: false toy_k: 10 Cluster Optimization (``chooser`` config section) ------------------------------------------------- Users can optimize clustering modularity using `ChooseR `_ with pipeline-specific modifications. This functionality is enabled only if the ``resolution`` field in the ``cluster`` section is set to ``null``. ``groups`` field ^^^^^^^^^^^^^^^^ dict. Each group to perform clustering parameter optimization. Group name (key) must be unique. All groups in the ``normalize`` and ``integrate`` (if applicable) sections can be assigned. ``npcs`` field ^^^^^^^^^^^^^^ integer, default values: - ``Gene.Expression``: 25 - ``Peaks``, ``MACS``, or ``Gene.Activity``: 20 The maximum number of linear reduced dimensions, computed from LSI or PCA, that are used during clustering. ``resolutions`` field ^^^^^^^^^^^^^^^^^^^^^ list of integers, default ``[0.6, 0.8, 1, 1.2, 1.4]``. Resolutions to use when bootstrapping cluster methods. Best to have a range spanning target resolution. .. warning:: Specifying ``1.0`` instead of ``1`` can cause an error. ``silhouette`` field ^^^^^^^^^^^^^^^^^^^^ list of strings. default ``silhouette``, ``frequency_grouped``, and ``silhouette_grouped``. Values are used during path parameter expansion in rules executing chooseR. It is advisable to not alter them. .. note:: All ``groups`` values specified in the config sections: ``normalize`` and ``integrate`` (if appicable) **must** have a group entry in the ``chooser`` config section. Example: .. code-block:: yaml chooser: groups: unintegrated_0: npcs: 25 unintegrated_1: npcs: 20 unintegrated_2: npcs: 20 unintegrated_3: npcs: 20 integrated_0: npcs: 25 integrated_1: npcs: 20 integrated_2: npcs: 20 integrated_3: npcs: 20 resolutions: - 0.6 - 0.8 - 1 - 1.2 - 1.4 silhouette: - silhouette - frequency_grouped - silhouette_grouped Clustering (``cluster`` config section) --------------------------------------- Users can determine a specific resolution for clustering or or rely on a dataset-optimized resolution computed using ``chooser``. .. _detection-method: ``detection_method`` field ^^^^^^^^^^^^^^^^^^^^^^^^^^ integer, default ``3``. Algorithm used for community detection during unimodal clustering. Available options are: - ``1``: original Louvain algorithm - ``2``: Louvain algorithm with multilevel refinement - ``3``: SLM algorithm - ``4``: Leiden algorithm (requires the leidenalg python) This value is passed to the ``algorithm`` argument of the ``FindClusters`` function. Refer to Seurat `Cluster Determination `_ for more details. ``resolution`` field ^^^^^^^^^^^^^^^^^^^^ float or ``null``, default ``null``. If ``null``, clustering is performed using an optimized resolution computed by ``chooser``. Example: .. code-block:: yaml cluster: detection_method: 3 resolution: null Weighted Nearest Neighbor (``weighted_nn`` config section) ---------------------------------------------------------- This section configures how to perform `Weighted Nearest Neighbor (WNN) analysis `_. WNN is similar to shared nearest neighbor (SNN), which is commonly used to build graphs for multiple modalities. WNN uses a list of weights from each specified modality, and is useful for incorporating low dimensional embeddings from multiple single cell modalities into a global reduced dimensional space. .. note:: - All cells for specified assays/groups **must have identical barcodes**, meaning this rule is currently suitable ONLY for multimodal data. For example 3' Gene Expression + CRISPR barcodes (Perturb-Seq), 3' Gene Expression + Protein barcodes (CITE-Seq), 10X Genomics Multiome (Gene Expression + ATAC), etc. - Disable this functionality if the input dataset is not multimodal. ``activate`` field ^^^^^^^^^^^^^^^^^^ boolean, default ``true`` (Multiome) or ``false`` (RNA/ATAC). Specify whether or not to run the coembed rule. ``groups`` field ^^^^^^^^^^^^^^^^ dict. Each group to perform weighted nearest neighbor analysis. Group name (key) must be unique. ``input_groups`` field ^^^^^^^^^^^^^^^^^^^^^^ list of strings, default: - ``wnn_0``: ``unintegrated_0`` and ``unintegrated_1`` - ``wnn_1``: ``integrated_0`` and ``integrated_1`` ``groups`` dictionary values from ``normalize`` and ``integrate`` config sections. Remember, unless performing multimodal integration, each ``group`` value corresponds to an assay. So in our example from the ``normalize`` config section, specifying ``unintegrated_0`` and ``unintegrated_1`` would combine the reduced dimensional weights of ``Gene.Expression`` and ``Peaks`` during WNN clustering. ``reduction`` field ^^^^^^^^^^^^^^^^^^^ list of strings, default ``pca`` and ``lsi``. Dimensionality reduction method used for a specified group. In our example from the ``normalize`` config section, specifying ``unintegrated_0`` and ``unintegrated_1`` would look for ``Gene.Expression`` reduced dimensions in the ``pca`` slot and ``Peaks`` reduced dimensions in the ``lsi`` slot during WNN clustering. ``umap_dims`` field ^^^^^^^^^^^^^^^^^^^ list of integers, default ``[[1, 25], [2, 20]]``. Dimensions to use for UMAP visualization for a specified group. ``resolution`` field ^^^^^^^^^^^^^^^^^^^^ integer, default ``0.6``. Resolution to use during community detection for multimodal clustering. ``detection_method`` field ^^^^^^^^^^^^^^^^^^^^^^^^^^ integer, default ``3``. Algorithm used for community detection during multimodal clustering. Refer to the :ref:`detection-method` in the ``cluster`` section above. Example: .. code-block:: yaml weighted_nn: activate: true groups: wnn_0: input_groups: - unintegrated_0 # corresponds to SCT - unintegrated_1 # corresponds to Peaks reduction: - pca - lsi umap_dims: - - 1 - 25 - - 1 - 20 resolution: 0.6 detection_method: 3 wnn_1: input_groups: - integrated_0 # corresponds to SCT - integrated_1 # corresponds to Peaks reduction: - integrated_pca - integrated_lsi umap_dims: - - 1 - 25 - - 1 - 20 resolution: 0.6 detection_method: 3 Differential Testing (``diff_analysis`` config section) ------------------------------------------------------- This section configures differential testing (i.e. differential gene expression, chromatin accessibility, TF motifs) using the ``FindAllMarkers`` function in Seurat. ``activate`` field ^^^^^^^^^^^^^^^^^^ boolean, default ``true``. Specify whether or not to run differential testing. ``groups`` field ^^^^^^^^^^^^^^^^ dict. Each group to perform differential testing. Group name (key) must be unique. ``cluster_idents`` field ^^^^^^^^^^^^^^^^^^^^^^^^ string, default ``seurat_clusters``. Which Seurat metadata column to use as labels for differential testing. Equivalent to ``obj <- SetIdents(cluster_idents)`` before running ``FindAllMarkers(obj)``. ``assay`` field ^^^^^^^^^^^^^^^ string, default ``null``. Which assay use for differential testing. This value is passed to the ``assay`` argument of the ``FindAllMarkers`` function. ``slot`` field ^^^^^^^^^^^^^^ string, default ``data``. Which slot to pull data from. This value is passed to the ``slot`` argument of the ``FindAllMarkers`` function. ``min_pct`` field ^^^^^^^^^^^^^^^^^ string, default ``null``. Only test genes that are detected in a minimum fraction of cells in either of the two populations. If ``null``, a default value of 0.01 is applied. This value is passed to the ``min.pct`` argument of the ``FindAllMarkers`` function. ``test_use`` field ^^^^^^^^^^^^^^^^^^ string, default ``null``. Test used for differential testing. This value is passed to the ``test.use`` argument of the ``FindAllMarkers`` function. If ``null``, the `Wilcoxon Rank Sum test `_ is used by default. Available methods are: - ``wilcox``: Wilcoxon Rank Sum test - ``wilcox_limma``: Limma implementation of the Wilcoxon Rank Sum test (Use this to reproduce results from Seurat v4) - ``bimod``: Likelihood-ratio test - ``roc``: ROC analysis - ``t``: Student's t-test - ``negbinom``: Negative binomial generalized linear model - ``poisson``: Poisson generalized linear model - ``LR``: Logistic regression model - ``MAST``: `MAST `_ framework. - ``DESeq2``: `DESeq2 `_ framework. (requires to install DESeq2 package in R) For more details, refer to the `FindAllMarkers `_ function in Seurat. ``latent_vars`` field ^^^^^^^^^^^^^^^^^^^^^ string, default ``null``. Variables to test, used only when ``test_use`` is one of ``LR``, ``negbinom``, ``poisson``, or ``MAST``. This value is passed to the ``latent.vars`` argument of the ``FindAllMarkers`` function. ``alpha`` field ^^^^^^^^^^^^^^^ float, default 0.05. False discovery rate (FDR) threshold to filter significant marker genes. .. note:: Only include ``groups`` values specified in the config sections: ``normalize``, ``integrate`` (if appicable) and ``weighted_nn`` (if appicable). .. warning:: For the current version of `multiome-wf`, ``LR`` has a bug where it grabs more nodes than allocated on a cluster node. Do not use ``LR`` on a cluster node. Example: .. code-block:: yaml diff_analysis: activate: true groups: unintegrated_0: cluster_idents: seurat_clusters assay: null slot: data min_pct: null test_use: null latent_vars: null alpha: 0.05 unintegrated_1: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_Peaks' alpha: 0.05 unintegrated_2: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_MACS' alpha: 0.05 unintegrated_3: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_Gene.Activity' alpha: 0.05 integrated_0: cluster_idents: seurat_clusters assay: null slot: data min_pct: null test_use: null latent_vars: null alpha: 0.05 integrated_1: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_Peaks' alpha: 0.05 integrated_3: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_Gene.Activity' alpha: 0.05 wnn_0: cluster_idents: seurat_clusters assay: SCT slot: data min_pct: null test_use: null latent_vars: null alpha: 0.05 wnn_1: cluster_idents: seurat_clusters assay: SCT slot: data min_pct: null test_use: null latent_vars: null alpha: 0.05 Example ~~~~~~~ A **basic** example of a ``config.yaml`` file using 2 Multiome batches is provided below. The analysis will be performed on all samples with and without integration, followed by clustering and differential testing. This example also includes automated optimization of clustering parameters. See :ref:`overview-wf` for more detailed examples of config files. .. code-block:: yaml samples: config/multiome-config/samples.tsv aggregates: config/multiome-config/aggregates.tsv assays: config/multiome-config/assays.tsv ANNOTATION: "EnsDb" ANNO_FILE: "path/to/genes.gtf.gz" qc: remove_outliers: true rm_outliers_method: sd meta_labels: - nCount_Gene.Expression - nCount_Peaks - percent.mt - TSS.enrichment lower: nCount_Gene.Expression: 100 nCount_Peaks: 1000 TSS.enrichment: 2 upper: null macs2: run: "Y" group_fragments_by: genome fasta: "../reference/genome.fa" chromsizes: "../reference/multiome.chromsizes" normalize: split_by: meta_geno groups: unintegrated_0: assay_name: Gene.Expression norm_method: sct unintegrated_1: assay_name: Peaks norm_method: lsi unintegrated_2: assay_name: MACS norm_method: lsi unintegrated_3: assay_name: Gene.Activity norm_method: log integrate: activate: true atac_integrate_embeddings: true split_by: meta_geno groups: integrated_0: assay_name: Gene.Expression norm_method: sct integrate_method: CCAIntegration integrate_dims: - 1 - 30 integrated_1: assay_name: Peaks norm_method: lsi integrate_method: rlsi integrate_dims: - 1 - 30 integrated_3: assay_name: Gene.Activity norm_method: log integrate_method: CCAIntegration integrate_dims: - 1 - 30 dataset_size: toydataset: false toy_k: 10 chooser: groups: unintegrated_0: npcs: 25 unintegrated_1: npcs: 20 unintegrated_2: npcs: 20 unintegrated_3: npcs: 20 integrated_0: npcs: 25 integrated_1: npcs: 20 integrated_2: npcs: 20 integrated_3: npcs: 20 resolutions: - 0.6 - 0.8 - 1 - 1.2 - 1.4 silhouette: - silhouette - frequency_grouped - silhouette_grouped cluster: detection_method: 3 resolution: null weighted_nn: activate: true groups: wnn_0: input_groups: - unintegrated_0 # corresponds to SCT - unintegrated_1 # corresponds to Peaks reduction: - pca - lsi umap_dims: - - 1 - 25 - - 1 - 20 resolution: 0.6 detection_method: 3 wnn_1: input_groups: - integrated_0 # corresponds to SCT - integrated_1 # corresponds to Peaks reduction: - integrated_pca - integrated_lsi umap_dims: - - 1 - 25 - - 1 - 20 resolution: 0.6 detection_method: 3 diff_analysis: activate: true groups: unintegrated_0: cluster_idents: seurat_clusters assay: null slot: data min_pct: null test_use: null latent_vars: null alpha: 0.05 unintegrated_1: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_Peaks' alpha: 0.05 unintegrated_2: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_MACS' alpha: 0.05 unintegrated_3: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_Gene.Activity' alpha: 0.05 integrated_0: cluster_idents: seurat_clusters assay: null slot: data min_pct: null test_use: null latent_vars: null alpha: 0.05 integrated_1: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_Peaks' alpha: 0.05 integrated_3: cluster_idents: seurat_clusters assay: null slot: data min_pct: 0.2 test_use: null latent_vars: 'nCount_Gene.Activity' alpha: 0.05 wnn_0: cluster_idents: seurat_clusters assay: SCT slot: data min_pct: null test_use: null latent_vars: null alpha: 0.05 wnn_1: cluster_idents: seurat_clusters assay: SCT slot: data min_pct: null test_use: null latent_vars: null alpha: 0.05