Knowledge base

How remove batch effects of RNA sequencing without control genes in R

Introduction

RUVseq can conduct a differential expression (DE) analysis that controls for “unwanted variation”, e.g., batch, library preparation, and other nuisance effects, using the between-sample normalization methods proposed. They call this approach RUVSeq for remove unwanted variation from RNA-Seq data
If no genes are known a priori not to be influenced by the covariates of interest, one can obtain a set of “in-silico empirical” negative controls, e.g., least significantly DE genes based on a first-pass DE analysis performed prior to RUVg normalization.

Code exmaple

Installation of RUVseq and edgeR from bioconductor

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

# The following initializes usage of Bioc devel
BiocManager::install(version='devel')

BiocManager::install("RUVSeq")
BiocManager::install("edgeR")

Load needed packge and data

library(RUVSeq)
library(zebrafishRNASeq)
data(zfGenes)

zfGenes data: one gene in a row, one sample in a column
Filter out non-expressed genes, by requiring more than 5 reads in at least two samples for each gene.

filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered <- zfGenes[filter,]

Defined group and saved data in SeqExpressionSet

x <- as.factor(rep(c("Ctl", "Trt"), each=3))
set <- newSeqExpressionSet(as.matrix(filtered),
phenoData = data.frame(x, row.names=colnames(filtered)))

Firstly we run DEG, then we get empirical control gene by considering all but the top 5000 genes as ranked by edgeR p-values.

design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
top <- topTags(lrt, n=nrow(set))$table
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))]

Now we can estimate the factors of unwanted variation

set2 <- RUVg(set, empirical, k=1)
pData(set2)
## x W_1
## Ctl1 Ctl -0.10879677
## Ctl3 Ctl 0.23066424
## Ctl5 Ctl 0.19926266
## Trt9 Trt 0.07672121
## Trt11 Trt -0.83540924
## Trt13 Trt 0.43755790

Now , We can put W_1 as the factors of unwanted variation combining with group factor into the negative binomial GLM model of edgeR. Then the DE genes without batch effect came out.

design <- model.matrix(~x + W_1, data=pData(set1))
y <- DGEList(counts=counts(set1), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt)  # DE gene order by pvalue

If someone wants to use DESeq2 instead of edgeR, the codes are as follows.

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = counts(set1),
colData = pData(set1),
design = ~ W_1 + x)
dds <- DESeq(dds)
res <- results(dds)
res

References

http://www.bioconductor.org/packages/devel/bioc/vignettes/RUVSeq/inst/doc/RUVSeq.pdf

Donation

[paypal-donation]

One thought on “How remove batch effects of RNA sequencing without control genes in R

Leave a Reply

Your email address will not be published. Required fields are marked *