Hello,
First post, extremely new to R. Please forgive my naivete, and any mistakes in making my question clear.
I have RNA seq data from whole blood from patients and healthy subjects, and unfortunately I made the mistake of sequencing the samples before understanding the dramatic problem of batch effects, made even worse with the unbalanced groupings.
Batch 1 total 41 samples= 19 healthy + 22 patients
Batch 2 total 23 samples = 2 healthy + 21 patients
As you can see, there is a very large discrepancy in the number of healthy in each batch
I have tried including batch as a part of the design,
dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata2,
design= ~Batch + Class) #i have used the term class to differentiate between patient and healthy
This approach yielded a suspiciously large number of DE genes (~20,000 of the ~25,000 total genes measured)
I read a related post suggesting to use
(design = ~Batch * Class)
I tried that and got a more "reasonable" number of DE genes (~3000), but I do not understand how the use of the *
operator affects the design. Can anyone explain or link something that can help me understand?
Batch * Class
is a short form ofBatch + Class + Batch*Class
. It means that there is also an interaction between Batch and Class. These are linear models, and it will help if you check some details on how linear modeling works, eg. http://www.jkarreth.net/files/RPOS517_Day11_Interact.htmlHi Thank you for your response!
I'm reading that tutorial you linked and trying to understand linear modeling as it applies to rna seq.
I'm understanding that if i use
in the design matrix it is saying there is an interaction. Does it make sense for me to use that approach? what I mean is that obviously there is no biological interaction between batches and classes of subject, the large variation between batches is creating a difference so that could be viewed as an interaction
hello,I recently encountered the same problem as you. How did you finally solve it. As for the "batch * class" you said, I think this Angle is very new. You said that you saw it in a post, could you share the source of this post?
In addition, my problem may be more complicated. My batch1 total 30sample= 30 disease A, batch2 90 total sample=40 healthy, 50 disease B. Do you or other friends have any suggestions? I wanted to compare disease A and disease B but I'm not comparing the two groups directly. What I did was to compare them with the healthy and take their specific differential genes