dealing with BATCH effect/ donor variation in RNA seq data
2
1
Entering edit mode
10.4 years ago
sbuschow ▴ 10

HI all,

I am struggling with a statistical question related to RNA seq data.

I have collected RNA data on 4 different cell types collected from the same person. I have collected cells sets/batches from 3 individuals. Each batch of 4 cell types was prepared for RNA sequencing separately.

My goals are 1) to identify the relation between these cell types (which ones are most comparable and which ones are more dfferent) 2) to find genes differentially expressed between the cell types

After recieving back the sequencing data I noticed a clear batch/donor effect between the 3 sets of samples.

Standard normalisation procedures were not effective as the batch effect was different for genes with a high number of reads as compared to those with low read counts. Samples clustered according to donor not cell type.

What does work pretty nicely is to divide the expression values for each sample by the mean expression over all 4 cell types from that donor (e.g. scale for the batch difference per gene) and then cluster the thus scaled values as input for clustering (after log transformation). After doing this for all donors separately I get a nice clustering according to cell type.

My question now is whether this action is something you can do? I can not find any literature on a similar case.

Secondly I would like to no whether I can use the obtained values for statistical test to find DEGs such as ANOVA? I realize I have made the samples interdependent per donor by centering on the mean and I am removing variation in gene expression levels, so it does not feel completely right but because the clustering performs so well I am tempted to continue. Also because genes that are always higher in one cell type as compared to the other would be interesting to me.

Any feed back on possible mistakes I am intruducing and/or alternatives methods I can use are very much appreciated!

Thank you!

Sonja

statistics RNA-Seq ANOVA • 5.6k views
ADD COMMENT
3
Entering edit mode
10.4 years ago
Ming Tommy Tang ★ 4.5k

Use bioconductor package sva

ADD COMMENT
0
Entering edit mode
9.7 years ago
kangyueapril ▴ 80

SVA is more suitable for microarray data. For RNA-seq, you can just leave the batch difference when you do normalized. But when you find the DEGs, you should build your model use both condition and batch as factor. Then find DEGs in condition factor.

ADD COMMENT

Login before adding your answer.

Traffic: 2604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6