0 Posted 2021-04-07Updated 2024-01-11Notes5 minutes read (About 689 words)

edgeR

edgeR: empirical analysis of DGE in R

cite: Mark D. Robinson, Davis J. McCarthy, Gordon K. Smyth, edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, Volume 26, Issue 1, 1 January 2010, Pages 139–140, https://doi.org/10.1093/bioinformatics/btp616

An overdispersed Poisson model is used to account for both biological and technical variability.

Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.

The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated.

Why EdgeR

For microarrays, the abundance of a particular transcript is measured as afluorescence intensity, effectively a continuous response

Digital gene expression (DGE) data the abundance is observed as a count

Therefore, procedures that are successful for microarray data are not directly applicable to DGE data

. edgeR is designed for the analysis of replicated count-based expression data and is an implementation of methology developed by Robinson and Smyth^[1]^[2].

It initially developed for serial analysis of gene expression (SAGE)
As a result, edgeR may also be useful in other experiments that generate counts, such as ChIP-seq, in proteomics experiments where spectral counts are used to summarize the peptide abundance^[3] or in barcoding experiments where several species are counted ^[4].

Digital gene expression: Digital gene expression (DGE) is a sequence-based approach for gene expression analyses, that generates a digital output at an unparalleled level of sensitivity^[5].

Serial analysis of gene expression (SAGE): Serial analysis of gene expression, or SAGE, is an experimental technique designed to gain a direct and quantitative measure of gene expression. The SAGE method is based on the isolation of unique sequence tags (9-10 bp in length) from individual mRNAs and concatenation of tags serially into long DNA molecules for a lump-sum sequencing^[6].

Spam test
$Spam test2$

Method

In limma (Smyth, 2004), where an empirical Bayes model is used to moderate the probe-wise variances.

In edgeR:
We assume the data can be summarized into a table of counts
We model the data as negative binomial (NB) distributed
$$
Y_ {gi} \sim NB(M_ i p_ {gj},\phi_g)
$$

For gene $_ g$ and sample $_ i$:
$M_i$: the library size (total number of reads),
$ϕ_g$: the dispersion
$p _{gj}$: is the relative abundance of gene $_g$ in experimental group $_j$ to which sample $_i$ belongs.

We use the NB parameterization where:

the mean is $\mu_ {gi} = M_ i p_ {gj}$
the variance is $μ_ {gi}(1+ \mu _ {gi} \phi _g)$

For differential expression analysis:

the parameters of interest are $p_ {gj}$.

The NB distribution is reduced to Poisson when $ \phi_g = 0$.

In some DGE applications, technical variation can be treated as Poisson.
In general, $\phi_g$ represents the coefficient of variation of biological variation between the samples. In this way, our model is able to separate biological from technical variation.

limma: dispersion estimates -> topTags: tabulate the top differentially expressed genes
-> plotSmear: MA plot

There are a few terms and algorithms I do not understand. So, I’ll update this page later.

Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance, Bioinformatics, 2007, vol. 23 (pg. 2881-2887) ↩︎
[Robinson MD, Smyth GK. Small sample estimation of negative binomial dispersion, with applications to SAGE data, Biostatistics, 2008, vol. 9 (pg. 321-332)] ↩︎
Andersson AF, et al. Comparative analysis of human gut microbiota by barcoded pyrosequencing, PLoS ONE, 2008, vol. 3 pg. e2836 ↩︎
Wong JWH, et al. Computational methods for the comparative quantification of proteins in label-free LCn-MS experiments, Brief. Bioinform., 2008, vol. 9 (pg. 156-165) ↩︎
Rodríguez-Esteban, G., González-Sastre, A., Rojo-Laguna, J.I. et al. Digital gene expression approach over multiple RNA-Seq data sets to detect neoblast transcriptional changes in Schmidtea mediterranea . BMC Genomics 16, 361 (2015). https://doi.org/10.1186/s12864-015-1533-1 ↩︎
Yamamoto M, Wakatsuki T, Hada A, Ryo A. Use of serial analysis of gene expression (SAGE) technology. J Immunol Methods. 2001 Apr;250(1-2):45-66. doi: 10.1016/s0022-1759(01)00305-2. PMID: 11251221. ↩︎