0 Posted 2022-10-03Updated 2024-01-11Biology / Bioinformatics / Protocol / RNA-Seq10 minutes read (About 1500 words)

Gene Set Enrichment Analysis (GESA) in R

How it works and why it matters

Here is an example explained by ChatGPT

Suppose we have a gene expression dataset consisting of 6 genes (A, B, C, D, E, and F) and 3 samples (S1, S2, and S3), and we want to test whether a pre-defined gene set called “MySet” is enriched in this dataset. Here’s the dataset:

Gene S1 S2 S3

A 1 2 1

B 2 4 1

C 3 6 2

D 4 8 3

E 1 2 1

F 2 4 1

The first step in GSEA is to rank the genes based on their correlation with the phenotype of interest. In this example, we will use the sum of expression values across samples as the phenotype. The ranked list of genes based on this phenotype is:

Gene Rank

D 15

C 11

B 7

F 7

A 4

E 4

The second step is to calculate an enrichment score for the gene set “MySet” by walking down the ranked list of genes, and computing a running sum of the signed correlation between each gene and the phenotype. If a gene is in “MySet”, its correlation is positive; otherwise, it is negative. The enrichment score is the maximum deviation from zero of the running sum. Here’s the calculation:

Gene Rank In MySet? Correlation Running Sum

D 15 No -1 -1

C 11 No -1 -2

B 7 Yes +1 -1

F 7 Yes +1 0

A 4 Yes +1 1

E 4 No -1 0

The maximum deviation from zero of the running sum is 1, which occurs at gene A. Therefore, the enrichment score for “MySet” is 1.

This example demonstrates the basic steps of GSEA, but in practice, GSEA is often applied to much larger datasets and gene sets, and the significance of the enrichment score is assessed using statistical tests such as permutation tests or gene set permutation tests.

Gene	S1	S2	S3
A	1	2	1
B	2	4	1
C	3	6	2
D	4	8	3
E	1	2	1
F	2	4	1

Gene	Rank
D	15
C	11
B	7
F	7
A	4
E	4

Gene	Rank	In MySet?	Correlation	Running Sum
D	15	No	-1	-1
C	11	No	-1	-2
B	7	Yes	+1	-1
F	7	Yes	+1	0
A	4	Yes	+1	1
E	4	No	-1	0

GESA

In clusterProfiler, there are functions designed to do GESA based on GO and KEGG gene sets. Here I am trying to show you how to do GESA with your customized data set.

There are two main functions which are GSEA from clusterProfiler and fgsea from fgsea. A quick enrichment analysis could be done by example data.

library(clusterProfiler)
library(plyr)
library(fgsea)

# loading example data set
data(examplePathways)
data(exampleRanks)

fgseaRes <- fgsea(examplePathways, exampleRanks, nperm=1000, maxSize=500, minSize = 10))

TERM2GENE = ldply(examplePathways, data.frame)
gse <- GSEA(gene = rev.default(exampleRanks) ,  TERM2GENE = TERM2GENE)
#gseaplot(gse, by = "all", title = gse$Description[1], geneSetID = 1)

Default Parameters for Two Functions

`fgsea`	`GSEA`
`fgsea( pathways, stats, nperm, minSize = 1, maxSize = Inf, nproc = 0, gseaParam = 1, BPPARAM = NULL)`	`GSEA( geneList, exponent = 1, nPerm = 1000, minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, pAdjustMethod = "BH", TERM2GENE, TERM2NAME = NA, verbose = TRUE, seed = FALSE, by = "fgsea")`

Check the Results from Tow Functions

List <- gse$Description[c(1,30,100)]
gse[gse$Description %in% List]
fgseaRes[fgseaRes$pathway %in% List,]

Description	setSize	enrichmentScore	NES	pvalue	p.adjust	qvalues	rank	leading_edge
5991611_Processive_synthesis_on_the_C-strand_of_the_telomere	11	0.747047513123146	1.92204468456492	0.00374531835205993	0.0280713464604887	0.0209222578082764	2110	tags=82%, list=18%, signal=67%
5990978_M_G1_Transition	63	0.578846390178477	2.27421327922627	0.00154320987654321	0.0161091249574396	0.012006522946795	1970	tags=48%, list=16%, signal=40%
5991851_Mitotic_Prometaphase	82	0.725326964773323	2.96349021486606	0.0015527950310559	0.0161091249574396	0.012006522946795	1042	tags=54%, list=9%, signal=49%

pathway	pval	padj	ES	NES	size
5991611_Processive_synthesis_on_the_C-strand_of_the_telomere	0.001865672	0.0307185	0.7470475	1.919031	11
5990978_M_G1_Transition	0.001545595	0.0307185	0.5788464	2.236994	63
5991851_Mitotic_Prometaphase	0.001492537	0.0307185	0.7253270	2.976492	82

According to the comparison from above, their p-values are similar. But GSEA has a smaller p.adjust. Other wise, Enrichment score and NES are the same.

Number of results

length(examplePathways)
nrow(fgseaRes)
nrow(GSEA(gene = rev.default(exampleRanks) ,  TERM2GENE = TERM2GENE, pvalueCutoff = 1)@result)

[1] 1457
[1] 757
[1] 757

In this result, we can find that the number of results is the same. Some of the gene sets are filtered out because of their size.

Result visualization

Both two packages have functions for Visualizing their results. Function plotEnrichment is more friendly to customized data. What a surprise is it has plotGseaTable function which from fgsea could show multiple results in a table-like panel. Though this graphic is not fancy but could be very helpful in some situations.

gseaplot(gse, geneSetID = "5991851_Mitotic_Prometaphase")
plotEnrichment(examplePathways[['5991851_Mitotic_Prometaphase']], exampleRanks)

# Example of using gse result
# plotEnrichment(gse@geneSets[['5991851_Mitotic_Prometaphase']], exampleRanks)

plotGseaTable(examplePathways[List], exampleRanks, fgseaRes,  
                  gseaParam=0.5)


`gseaplot` from `clusterProfiler`	`plotEnrichment` from `fgsea`

Check the Data Formats and Fabric a group of GeneList & GeneSets

exampleRanks:

  170942    109711     18124     12775     72148     16010
  -63.33703 -49.74779 -43.63878 -41.51889 -33.26039 -32.77626

examplePathways:

  $`186589_Late_stage_branching_morphogenesis_pancreatic_bud_precursor_cells`
  [1] "11925"  "15205"  "21410"  "246086"

The above two lists show a quick view of example data. exampleRanks is an increasing/decreasing numeric vector. examplePathways is a list that contains the name of each set and genes under each set.

An example of generating your own data:

set.seed(1)
GeneList <- sample(seq(-100,100,0.01), 100)
names(GeneList) <- paste("Gene", c(1:100), sep="_")
GeneSet <- append(list(names(GeneList)[1:50]), list(names(GeneList)[40:100]))
names(GeneSet) <- c("Set1", "Set2")

# List table for clusterProfiler
TERM2GENE_exp = ldply(GeneSet, data.frame)
# Sort the GeneList
GeneList <- sort(GeneList, decreasing =  T)
fgseaRes <- fgsea(GeneSet, GeneList, nperm=1000, maxSize=500, minSize = 10)

gse <- GSEA(gene = GeneList ,  TERM2GENE = TERM2GENE_exp, pvalueCutoff = 1)

pathway	pval	padj	ES	NES	nMoreExtreme	size
Set1	0.845140032948929	0.980519480519481	0.20117826706011	0.764831637359588	512	50
Set2	0.980519480519481	0.980519480519481	0.156182133808093	0.582211034023132	603	61

Description	setSize	enrichmentScore	NES	pvalue	p.adjust	qvalues	rank	leading_edge
Set1	50	0.20117826706011	0.750929659683889	0.834633385335413	0.971742543171115	0.971742543171115	34	tags=38%, list=34%, signal=50%
Set2	61	0.156182133808093	0.586558400598972	0.971742543171115	0.971742543171115	0.971742543171115	18	tags=18%, list=18%, signal=38%

Enrichment Plot

library(enrichplot)

gseaplot2(gse, geneSetID = 1, title = gse$Description[1], color = 'salmon', pvalue_table = TRUE)
gseaplot2(gse, geneSetID = 1:7, title = "GSEA", color = 'salmon', pvalue_table = TRUE)


`geneSetID = 1`	`geneSetID = 1:7`

What is nPerm

The nominal p value estimates the statistical significance of the enrichment score for a single gene set. However, when you are evaluating multiple gene sets, you must correct for gene set size and multiple hypothesis testing. Because the p value is not adjusted for either, it is of limited value when comparing gene sets.

The FDR is adjusted for gene set size and multiple hypotheses testing while the p value is not. When a top gene set has a small nominal p value and a high FDR value, it generally indicates that it is not as significant when compared with other gene sets in the empirical null distribution. This could be because you do not have enough samples, the biological signal is subtle, or the gene sets do not represent the biology in question very well. On the other hand, the FDR is based on two distributions of all gene sets; if only one of many gene sets is enriched, that gene set is likely to have a high FDR. Finally, a top gene set with a high nominal p value and a low FDR value, generally indicates a negative result: the gene set itself is not significant and other sets are weaker.

In the GSEA report, a p value of zero (0.0) indicates an actual p value of less than 1/number-of-permutations. For example, if the analysis performed 100 permutations, a reported p value of 0.0 indicates an actual p value of less than 0.01. For a more accurate p value, increase the number of permutations performed by the analysis. Typically, you will want to perform 1000 permutations (phenotype or gene_set). (If you attempt to perform significantly more than 1000 permutations, GSEA may run out of memory.)

More FAQ could be found at: software.broadinstitute.org

Other things you need to know

ES (enrichment score): reflects the degree to which a gene-set is overrepresented at the top or bottom of a ranked list of genes.

NES (normalized enrichment score): NES corrects for differences in ES between gene-sets due to differences in gene-set sizes. It enables to compare the scores of the different tested gene-sets with each other.
NES = actual ES / mean of all ESs obtained from all random permutations for the single gene-set that is being tested

nom p-value: The nominal p value estimates the statistical significance of the enrichment score for a single gene set. The p-value is calculated from the null distribution.
Using gene-set permutation, the null distribution is created by generating, for each permutation, a random gene set the same size as your specified gene set by selecting that number of genes from all of the genes in your expression data set (or pre-ranked list), and then calculating the enrichment score for that randomly selected gene set. The distribution of those enrichment scores across all of the permutations constitutes the null distribution.

FDR: corrects for multiple hypothesis testing and enable a more correct comparison of the different tested gene-sets with each other.

note: for a given gene-set S and observed NES, called NES*, FDR is [% of all NES (including permutations) >= NES*] / [% of all observed NES (=NES for all tested gene-sets) >= NES*]

relationships between ES, pvalue , NES and FDR:

pvalue is calculated from ES

FDR is calculated from NES

the higher the ES or NES and the lowest the FDR or pvalue

NES above 1.4 will usually give significant results

© BADER LAB, 2016