Overview of input files

multiome-wf was built around 10X Genomics data structures, but has been generalized to support most single cell sequencing experiments. Here we will review how matrices generated by 10X Genomics pipelines are organized. The subtle differences in 10X Genomics pipelines have important implications in setting up config files, especially samples tables and aggregates tables.

10X Genomics

All 10X genomics products utilized droplet encapsulation of cells. Within the droplet, 2 ID oligo sequences are added to the cDNA or gDNA fragments: 1) a unique oligo for each cell, called a cell barcode or 10x barcode; and 2) a unique oligo for each library, called a sample index.

If we consider an example: cells 1, 2 and 3 from sample A, then we can map the cell and sample indexes as follows:

cell	10x barcode
1	AAAAAATTTTTT
2	TTTTTTAAAAAA
3	ATATATATATAT

library	sample index
A	1

Combining the cell and library dictionary values, we get the following unique IDs for cells 1, 2 and 3.

cell	ID
1	AAAAAATTTTTT-1
2	TTTTTTAAAAAA-1
3	ATATATATATAT-1

What happens if we combine more libraries in a single experiment? There are a finite number of unique 10x barcodes that can be used for cells, so 10X Genomics recycles the same 10X Barcodes for each library, but uses a different sample index.

If we consider an example: cells 1, 2 and 3 from sample A and cells 1, 2 and 3 from sample B, then we can map the cell and sample indexes as follows:

cell	10x barcode
1	AAAAAATTTTTT
2	TTTTTTAAAAAA
3	ATATATATATAT

library id: sample index {A: GGGCCC}

library	sample index
A	1

cell	10x barcode
1	AAAAAATTTTTT
2	TTTTTTAAAAAA
3	ATATATATATAT

library id: sample index {B: GGGCCC}

library id: library count {B: 2}

Combining the cell and library dictionary values, we get the following unique IDs for cells 1, 2 and 3.

cell	ID
1	AAAAAATTTTTT-1
2	TTTTTTAAAAAA-1
3	ATATATATATAT-1
1	AAAAAATTTTTT-2
2	TTTTTTAAAAAA-2
3	ATATATATATAT-2

There is now unique mapping of transcripts (or other counts) to unique cell-library tuples, even though the individual 10x barcodes are not unique.

The approaches above are how cellranger pipelines parse count data. In the case of cellranger count, all the cell-library strings will end with “1”. In the case of cellranger aggr, all the cell-library strings will end with the unique library count that has been assigned while combining libraries. In fact, the library count is just a 1-based running count of the row number a library id is entered in the cellranger aggr --csv=aggregates.csv <more_parameters>.

$ cat aggregates.csv
sample_id,molecule_h5
WT,WT/outs/molecule_info.h5
Het,Het/outs/molecule_info.h5
KO,KO/outs/molecule_info.h5

Replicates

Biological replicates

Biological replicates are generated by sequencing independent animals sampled under the same experimental conditions such as genotype, tissue, and age. Having biological replicates allows to capture true biological variability.

Technical replicates

Technical replicates are generated from multiple sequencing runs on the same biological replicate. This can involve multiple flowcells, multiple chips, or multiple libraries prepared from a single biological replicate. Technical replicates are used to assess technical artifacts arising from sequencing steps and/or to increase sequencing coverage.

Non-10X Genomics

Non-10X Genomics datasets are also compatible with multiome-wf. Here’s an example downloaded from GSE239808, which was sequenced using Smart-seq2:

$ tree smartseq2/
smartseq2/
├── rep1
│   ├── GSM7674039_HFD_1_barcodes.tsv.gz
│   ├── GSM7674039_HFD_1_features.tsv.gz
│   └── GSM7674039_HFD_1_matrix.mtx.gz
└── rep2
    ├── GSM7674040_HFD_2_barcodes.tsv.gz
    ├── GSM7674040_HFD_2_features.tsv.gz
    └── GSM7674040_HFD_2_matrix.mtx.gz

Note that multiome-wf requires the following files to be in the same directory:

barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz