RNA-seq: filtering, quality control and visualisation
Library sizes and distribution plots
plot the library sizes as a barplot to see whether there are any major discrepancies between the samples
examine the distributions of the raw counts
Count data is not normally distributed, so if we want to examine the distributions of the raw counts we need to log the counts.
We can use the vst function from DESeq2 to apply a variance-stablising transformation.
The effect is to remove the dependence of the variance on the mean, particularly the high variance of the logarithm of count data when the mean is low.
The resulting counts have also been normalized with respect to library size or other normalization factors.
The resulting counts have also been normalized with respect to library size or other normalization factors. If a sample is really far above or below the blue horizontal line we may need to investigate that sample further.
TMM normalization and voom transformation
By running gdcVoomNormalization()
function, raw counts data would be normalized by TMM method implemented in edgeR(Robinson, McCarthy, and Smyth 2010) and further transformed by the voom method provided in limma(Ritchie et al. 2015).
Low expression genes (logcpm < 1 in more than half of the samples) will be filtered out by default.