Overview of input files
multiome-wf was built around 10X Genomics data structures, but has been generalized to support most single cell sequencing experiments. Here we will review how matrices generated by 10X Genomics pipelines are organized. The subtle differences in 10X Genomics pipelines have important implications in setting up config files, especially samples tables and aggregates tables.
10X Genomics
All 10X genomics products utilized droplet encapsulation of cells. Within the droplet, 2 ID oligo sequences are added to the cDNA or gDNA fragments: 1) a unique oligo for each cell, called a cell barcode or 10x barcode; and 2) a unique oligo for each library, called a sample index.
If we consider an example: cells 1, 2 and 3 from sample A, then we can map the cell and sample indexes as follows:
cell |
10x barcode |
|---|---|
1 |
AAAAAATTTTTT |
2 |
TTTTTTAAAAAA |
3 |
ATATATATATAT |
library |
sample index |
|---|---|
A |
1 |
Combining the cell and library dictionary values, we get the following unique IDs for cells 1, 2 and 3.
cell |
ID |
|---|---|
1 |
AAAAAATTTTTT-1 |
2 |
TTTTTTAAAAAA-1 |
3 |
ATATATATATAT-1 |
What happens if we combine more libraries in a single experiment? There are a finite number of unique 10x barcodes that can be used for cells, so 10X Genomics recycles the same 10X Barcodes for each library, but uses a different sample index.
If we consider an example: cells 1, 2 and 3 from sample A and cells 1, 2 and 3 from sample B, then we can map the cell and sample indexes as follows:
cell |
10x barcode |
|---|---|
1 |
AAAAAATTTTTT |
2 |
TTTTTTAAAAAA |
3 |
ATATATATATAT |
library id: sample index {A: GGGCCC}
library |
sample index |
|---|---|
A |
1 |
cell |
10x barcode |
|---|---|
1 |
AAAAAATTTTTT |
2 |
TTTTTTAAAAAA |
3 |
ATATATATATAT |
library id: sample index {B: GGGCCC}
library id: library count {B: 2}
Combining the cell and library dictionary values, we get the following unique IDs for cells 1, 2 and 3.
cell |
ID |
|---|---|
1 |
AAAAAATTTTTT-1 |
2 |
TTTTTTAAAAAA-1 |
3 |
ATATATATATAT-1 |
1 |
AAAAAATTTTTT-2 |
2 |
TTTTTTAAAAAA-2 |
3 |
ATATATATATAT-2 |
There is now unique mapping of transcripts (or other counts) to unique cell-library tuples, even though the individual 10x barcodes are not unique.
The approaches above are how cellranger pipelines parse count data. In the case of
cellranger count, all the cell-library strings will end with “1”. In the case of
cellranger aggr, all the cell-library strings will end with the unique library
count that has been assigned while combining libraries. In fact, the library count is
just a 1-based running count of the row number a library id is entered in the
cellranger aggr --csv=aggregates.csv <more_parameters>.
$ cat aggregates.csv
sample_id,molecule_h5
WT,WT/outs/molecule_info.h5
Het,Het/outs/molecule_info.h5
KO,KO/outs/molecule_info.h5
Replicates
Biological replicates
Biological replicates are generated by sequencing independent animals sampled under the same experimental conditions such as genotype, tissue, and age. Having biological replicates allows to capture true biological variability.
Technical replicates
Technical replicates are generated from multiple sequencing runs on the same biological replicate. This can involve multiple flowcells, multiple chips, or multiple libraries prepared from a single biological replicate. Technical replicates are used to assess technical artifacts arising from sequencing steps and/or to increase sequencing coverage.
Non-10X Genomics
Non-10X Genomics datasets are also compatible with multiome-wf. Here’s an example downloaded from GSE239808, which was sequenced using Smart-seq2:
$ tree smartseq2/
smartseq2/
├── rep1
│ ├── GSM7674039_HFD_1_barcodes.tsv.gz
│ ├── GSM7674039_HFD_1_features.tsv.gz
│ └── GSM7674039_HFD_1_matrix.mtx.gz
└── rep2
├── GSM7674040_HFD_2_barcodes.tsv.gz
├── GSM7674040_HFD_2_features.tsv.gz
└── GSM7674040_HFD_2_matrix.mtx.gz
Note that multiome-wf requires the following files to be in the same directory:
barcodes.tsv.gzfeatures.tsv.gzmatrix.mtx.gz