edgeR

edgeR: empirical analysis of DGE in R

cite: Mark D. Robinson, Davis J. McCarthy, Gordon K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, Volume 26, Issue 1, 1 January 2010, Pages 139–140, https://doi.org/10.1093/bioinformatics/btp616

  • An overdispersed Poisson model is used to account for both biological and technical variability.
  • Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.
  • The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated.

Why EdgeR

  • For microarrays, the abundance of a particular transcript is measured as afluorescence intensity, effectively a continuous response
  • Digital gene expression (DGE) data the abundance is observed as a count
  • Therefore, procedures that are successful for microarray data are not directly applicable to DGE data
  • . edgeR is designed for the analysis of replicated count-based expression data and is an implementation of methology developed by Robinson and Smyth[1][2].
  • It initially developed for serial analysis of gene expression (SAGE)
    As a result, edgeR may also be useful in other experiments that generate counts, such as ChIP-seq, in proteomics experiments where spectral counts are used to summarize the peptide abundance[3] or in barcoding experiments where several species are counted [4].

Digital gene expression: Digital gene expression (DGE) is a sequence-based approach for gene expression analyses, that generates a digital output at an unparalleled level of sensitivity[5].

Serial analysis of gene expression (SAGE): Serial analysis of gene expression, or SAGE, is an experimental technique designed to gain a direct and quantitative measure of gene expression. The SAGE method is based on the isolation of unique sequence tags (9-10 bp in length) from individual mRNAs and concatenation of tags serially into long DNA molecules for a lump-sum sequencing[6].

Spam test
Spam test2


Method

In limma (Smyth, 2004), where an empirical Bayes model is used to moderate the probe-wise variances.

In edgeR:
We assume the data can be summarized into a table of counts
We model the data as negative binomial (NB) distributed
$$
Y_ {gi} \sim NB(M_ i p_ {gj},\phi_g)
$$

For gene $_ g$ and sample $_ i$:
$M_i$: the library size (total number of reads),
$ϕ_g$: the dispersion
$p _{gj}$: is the relative abundance of gene $_g$ in experimental group $_j$ to which sample $_i$ belongs.

We use the NB parameterization where:

  • the mean is $\mu_ {gi} = M_ i p_ {gj}$
  • the variance is $μ_ {gi}(1+ \mu _ {gi} \phi _g)$

For differential expression analysis:

  • the parameters of interest are $p_ {gj}$.

The NB distribution is reduced to Poisson when $ \phi_g = 0$.

In some DGE applications, technical variation can be treated as Poisson.
In general, $\phi_g$ represents the coefficient of variation of biological variation between the samples. In this way, our model is able to separate biological from technical variation.

limma: dispersion estimates -> topTags: tabulate the top differentially expressed genes
-> plotSmear: MA plot

More

There are a few terms and algorithms I do not understand. So, I’ll update this page later.


  1. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance, Bioinformatics, 2007, vol. 23 (pg. 2881-2887) ↩︎

  2. [Robinson MD, Smyth GK. Small sample estimation of negative binomial dispersion, with applications to SAGE data, Biostatistics, 2008, vol. 9 (pg. 321-332)] ↩︎

  3. Andersson AF, et al. Comparative analysis of human gut microbiota by barcoded pyrosequencing, PLoS ONE, 2008, vol. 3 pg. e2836 ↩︎

  4. Wong JWH, et al. Computational methods for the comparative quantification of proteins in label-free LCn-MS experiments, Brief. Bioinform., 2008, vol. 9 (pg. 156-165) ↩︎

  5. Rodríguez-Esteban, G., González-Sastre, A., Rojo-Laguna, J.I. et al. Digital gene expression approach over multiple RNA-Seq data sets to detect neoblast transcriptional changes in Schmidtea mediterranea . BMC Genomics 16, 361 (2015). https://doi.org/10.1186/s12864-015-1533-1 ↩︎

  6. Yamamoto M, Wakatsuki T, Hada A, Ryo A. Use of serial analysis of gene expression (SAGE) technology. J Immunol Methods. 2001 Apr;250(1-2):45-66. doi: 10.1016/s0022-1759(01)00305-2. PMID: 11251221. ↩︎

Author

Karobben

Posted on

2021-04-07

Updated on

2024-01-11

Licensed under

Comments