Bulk RNA-seq analysis

RNA-seq measures what genes are transcribed. We typically differentiate between bulk RNA-seq (measuring transcripts from many cells mixed together) and single-cell RNA-seq (keeping track of which transcript measurements came from which cell). This page is about bulk RNA-seq.

BSPC’s adaptation of Harvard Bioinformatics Core training materials

NICHD BSPC’s RNA-seq training is an NIH-specific adaptation of the original RNA-seq workshops from the Harvard Bioinformatics Core training materials.

These lessons walk you through using bash on NIH’s Biowulf cluster to do your own basic RNA-seq analysis.

Upon completing those materials, you will be able to run arbitrary RNA-seq analyses, and you may be ready to learn Snakemake and either write your own workflows or use something like lcdb-wf.

Skills for RNA-seq

This gives a very rough sense of what beginner/intermediate RNA-seq skills might look like:

Level 1

  • Be able to describe the lines and fields of the following formats: FASTQ, FASTA, BAM, GTF

  • Be able to compare and contrast the same formats (what information is included/missing; how are they created; when you would use them)

  • Locate and prepare reference data (including aligner/pseudoaligner indexes)

  • Run FastQC on FASTQ and BAM files

  • Align reads with STAR, HISAT2, or other aligner and quantify reads with featureCounts

  • Quantify reads with Kallisto or Salmon

  • Import a counts table into R

  • Run basic 2-condition DESeq2 differential expression analysis

  • Find the most highly-changed genes

  • Plot an MA plot or volcano plot

  • Export results to TSV or Excel

  • Do functional enrichment analysis to find pathways enriched in changed genes

Level 2

  • Run QC using multiple tools (preseq, rRNA contamination, Picard, RSeQC, MultiQC)

  • Interpret QC reports to make suggestions for future bench work

  • Interpret raw p-value histograms

  • Be able to describe what fold-change shrinkage is doing

  • Be able to describe how DESeq2 handles low counts

  • Work with more complex experimental designs (batch effects, interaction terms) in DESeq2 and explain the results

  • Visualize data (bam, bigwig) in a genome browser

  • Using automated reproducible workflows (Snakemake, Nextflow, etc)

Other resources

The remainder of this page goes into some more detail on various aspects of RNA-seq analysis, to be used as supplemental material.

  • The introductory paper Hitchiker’s guide to expression analysis is co-authored by many big names in the field and gives a great overview and history.

  • The DESeq2 paper, while going over the details of the popular differential expression algorithm, is very approachable even for someone with not a lot of math/stats/algorithm background.

  • Especially when doing in vitro research with cell lines, it’s important to think about what a replicate really is. This blog post is a good discussion of the difference between technical replicates and biological replicates in vitro.

  • A library may be stranded, reverse stranded, or unstranded depending on what kit was used for the library prep. These figures help visualize the different strand-specific protocols. If you’re unsure, RSeQC’s infer_experiment.py can help you figure it out given a BAM and a BED file of genes. The Griffith lab’s post on strandedness also includes which library prep kits result in which kinds of libraries.

  • We base part of our RNA-seq template off of the Bioconductor RNA-seq workflow, which shows all the steps of RNA-seq from within R.

  • In BSPC, we develop and maintain lcdb-wf, which automates much of RNA-seq, and provides an extensive RMarkdown file for downstream analysis.

  • The DESeq2 vignette is the authoritative source on how to use DESeq2.

  • The DESeq2 paper is very well written, and describes how DESeq2 is actually working

  • A nice treatment of interaction terms, along with plots to help understand what’s being tested.

  • For complex experimental designs, this tutorial shows an elegant, general method for creating the proper contrasts.

  • You may have heard the terms RPKM, FPKM, RPM, and TPM. Which to use? Short answer: use TPM. Longer answer, with figures, is here: https://ro-che.info/articles/2016-11-28-rna-seq-normalization. The accompanying slides are useful for discussion.