Gene set enrichment analysis

Gene sets

A gene set is an unordered collection of genes that are functionally related.

  • Gene Ontology (GO)
  • Kyoto Encyclopedia of Genes and Genomes (KEGG)
  • Disease Ontology (DO)
  • Molecular Signatures Database (MSigDb)
  • CellMarker

Over Representation Analysis

The goal of Over Representation Analysis (ORA) is to determine whether the gene set of a known biological functions or processes are over-represented(enriched) in an experimentally-derived gene list L.

  • Pearson's chi-squared test
  • Fisher's exact test
  • Hypergeometric test

Gene set enrichment analysis

The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L, in which case the gene set is correlated with the phenotypic class distinction.

Kolmogorov-Smirnov statistic

gene set variation analysis

Gene set variation analysis (GSVA) is a particular type of gene set enrichment method that works on single samples. It enables pathway-centric analyses of molecular data by performing a conceptually simple but powerful change in the functional unit of analysis, from gene to gene set.

Gene set variation analysis (GSVA) provides an estimate of pathway activity by transforming an input gene-by-sample expression data matrix into a corresponding gene-set-by-sample expression data matrix.

gsva.es <- gsva(X, gs, verbose=FALSE, method="gsva")

Method to employ in the estimation of gene-set enrichment scores per sample. By default this is set to gsva(Hänzelmann, Castelo, and Guinney 2013) and other options are:

  • ssgsea (Barbie et al. 2009)
  • zscore (Lee et al. 2008)
  • plage (Tomfohr, Lu, and Kepler 2005)

The only requirement to do the RNA-seq integer count data is to set the argument kcdf="Poisson", which is "Gaussian" by default.

If our RNA-seq derived expression levels would be continuous, such as log-CPMs, log-RPKMs or log-TPMs, the default value of the kcdf argument should remain unchanged.