Bulk RNA-seq analysis¶
RNA-seq measures what genes are transcribed. We typically differentiate between bulk RNA-seq (measuring transcripts from many cells mixed together) and single-cell RNA-seq (keeping track of which transcript measurements came from which cell). This page is about bulk RNA-seq.
BSPC’s adaptation of Harvard Bioinformatics Core training materials¶
NICHD BSPC’s RNA-seq training is an NIH-specific adaptation of the original RNA-seq workshops from the Harvard Bioinformatics Core training materials.
These lessons walk you through using bash on NIH’s Biowulf cluster to do your own basic RNA-seq analysis.
Upon completing those materials, you will be able to run arbitrary RNA-seq analyses, and you may be ready to learn Snakemake and either write your own workflows or use something like lcdb-wf.
Skills for RNA-seq¶
This gives a very rough sense of what beginner/intermediate RNA-seq skills might look like:
Level 1¶
- Be able to describe the lines and fields of the following formats: FASTQ, FASTA, BAM, GTF 
- Be able to compare and contrast the same formats (what information is included/missing; how are they created; when you would use them) 
- Locate and prepare reference data (including aligner/pseudoaligner indexes) 
- Run FastQC on FASTQ and BAM files 
- Align reads with STAR, HISAT2, or other aligner and quantify reads with featureCounts 
- Quantify reads with Kallisto or Salmon 
- Import a counts table into R 
- Run basic 2-condition DESeq2 differential expression analysis 
- Find the most highly-changed genes 
- Plot an MA plot or volcano plot 
- Export results to TSV or Excel 
- Do functional enrichment analysis to find pathways enriched in changed genes 
Level 2¶
- Run QC using multiple tools (preseq, rRNA contamination, Picard, RSeQC, MultiQC) 
- Interpret QC reports to make suggestions for future bench work 
- Interpret raw p-value histograms 
- Be able to describe what fold-change shrinkage is doing 
- Be able to describe how DESeq2 handles low counts 
- Work with more complex experimental designs (batch effects, interaction terms) in DESeq2 and explain the results 
- Visualize data (bam, bigwig) in a genome browser 
- Using automated reproducible workflows (Snakemake, Nextflow, etc) 
Other resources¶
The remainder of this page goes into some more detail on various aspects of RNA-seq analysis, to be used as supplemental material.
- The introductory paper Hitchiker’s guide to expression analysis is co-authored by many big names in the field and gives a great overview and history. 
- The DESeq2 paper, while going over the details of the popular differential expression algorithm, is very approachable even for someone with not a lot of math/stats/algorithm background. 
- Especially when doing in vitro research with cell lines, it’s important to think about what a replicate really is. This blog post is a good discussion of the difference between technical replicates and biological replicates in vitro. 
- A library may be stranded, reverse stranded, or unstranded depending on what kit was used for the library prep. These figures help visualize the different strand-specific protocols. If you’re unsure, RSeQC’s infer_experiment.py can help you figure it out given a BAM and a BED file of genes. The Griffith lab’s post on strandedness also includes which library prep kits result in which kinds of libraries. 
- We base part of our RNA-seq template off of the Bioconductor RNA-seq workflow, which shows all the steps of RNA-seq from within R. 
- In BSPC, we develop and maintain lcdb-wf, which automates much of RNA-seq, and provides an extensive RMarkdown file for downstream analysis. 
- The DESeq2 vignette is the authoritative source on how to use DESeq2. 
- The DESeq2 paper is very well written, and describes how DESeq2 is actually working 
- A nice treatment of interaction terms, along with plots to help understand what’s being tested. 
- For complex experimental designs, this tutorial shows an elegant, general method for creating the proper contrasts. 
- You may have heard the terms RPKM, FPKM, RPM, and TPM. Which to use? Short answer: use TPM. Longer answer, with figures, is here: https://ro-che.info/articles/2016-11-28-rna-seq-normalization. The accompanying slides are useful for discussion.