Removing/Accounting for batch effects in unbalanced group RNAseq data
1
0
Entering edit mode
2.8 years ago
claysandel • 0

Hello,

First post, extremely new to R. Please forgive my naivete, and any mistakes in making my question clear.

I have RNA seq data from whole blood from patients and healthy subjects, and unfortunately I made the mistake of sequencing the samples before understanding the dramatic problem of batch effects, made even worse with the unbalanced groupings.

Batch 1 total 41 samples= 19 healthy + 22 patients

Batch 2 total 23 samples = 2 healthy + 21 patients

As you can see, there is a very large discrepancy in the number of healthy in each batch
I have tried including batch as a part of the design,

dds <- DESeqDataSetFromMatrix(countData = cts,  colData = coldata2,
                   design= ~Batch + Class) #i have used the term class to differentiate between patient and healthy

This approach yielded a suspiciously large number of DE genes (~20,000 of the ~25,000 total genes measured)

I read a related post suggesting to use

(design = ~Batch * Class)

I tried that and got a more "reasonable" number of DE genes (~3000), but I do not understand how the use of the * operator affects the design. Can anyone explain or link something that can help me understand?

rna-seq batch-effect deseq • 1.8k views
ADD COMMENT
1
Entering edit mode

Batch * Class is a short form of Batch + Class + Batch*Class. It means that there is also an interaction between Batch and Class. These are linear models, and it will help if you check some details on how linear modeling works, eg. http://www.jkarreth.net/files/RPOS517_Day11_Interact.html

ADD REPLY
0
Entering edit mode

Hi Thank you for your response!

I'm reading that tutorial you linked and trying to understand linear modeling as it applies to rna seq.

I'm understanding that if i use

Batch * Class

in the design matrix it is saying there is an interaction. Does it make sense for me to use that approach? what I mean is that obviously there is no biological interaction between batches and classes of subject, the large variation between batches is creating a difference so that could be viewed as an interaction

ADD REPLY
0
Entering edit mode

hello,I recently encountered the same problem as you. How did you finally solve it. As for the "batch * class" you said, I think this Angle is very new. You said that you saw it in a post, could you share the source of this post?

In addition, my problem may be more complicated. My batch1 total 30sample= 30 disease A, batch2 90 total sample=40 healthy, 50 disease B. Do you or other friends have any suggestions? I wanted to compare disease A and disease B but I'm not comparing the two groups directly. What I did was to compare them with the healthy and take their specific differential genes

ADD REPLY
0
Entering edit mode
2.8 years ago
nn ▴ 30

Hi,

I think in a first step you should try to get a feeling for the extent of variance that is due to the (presumed) batch effect. For this, a top level view of the data, i.e. PCA or MDS plot would be of great value. Additionally, hierachical clustering or (dis)similarity analysis. If you've identified the batch to in fact influence your dataset, you could then visually infer the impact of different correction strategies on the data. Using limma::removebatcheffect(), combat (or even sva) you get the corrected datamatrix which you can again plug into your EDA for comparison to the original dataset.

From what you describe, I think a sensible way to construct the design matrix would be to include an intercept and the batch and the covariate of interest such as

design = ~ 0 + batch + class 

then you'd design a contrast matrix as

healthy - diseased

and get your DE genes via a lmFit() followed by a contrast.fit() and eBayes-moderation.

ADD COMMENT
0
Entering edit mode

Thank you for your response! I did do a pca first off when i got both datasets to visualize the data and that's how i became aware of the batch effect. I posted the pca plot below. It is quite large. From this it looks like the vast majority of the variance is due to the batch effect.

When i used DEseq to find DE genes without any effort to compensate for the batch effect I got a large number of genes when i compare healthy v patient. As a test I did DEseq to find DE genes between just batch 1 and batch 2, and virtually all of those DE genes overlapped. This was a simple novice like approach but it told me my intuition is right that the batch effect is a large confounder.

I have two questions about what you suggested I'm very new to linear modeling in general and my present struggles with the rna seq data is my first foray into this world, so please forgive my amateur questions

with the design matrix, what led you to suggest including an intercept?

second, can i still use the DEseq functions to carry out the de analysis with that design matrix, or are the limma functions better suited?

PCA plot

ADD REPLY

Login before adding your answer.

Traffic: 2317 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6