I've been looking into a SARS-CoV-2 dataset, GSE152075. To put simply, they took RNA from infection positive and infection negative people. My aim is to look at the gene expression of a dozen genes I'm interested in to see if it is different based on infection status.
To find the DEG between positive vs negative using DESeq2, the authors used the design, sequencing_batch + infection_status. See summary below for batch numbers and infection status splits.
Batches; A: 19, B: 15, C: 17, D: 14, E: 23, F: 34, G: 16, H: 21, I: 15, J: 26, K: 10, L: 43, M: 19, N: 13, O: 28, P: 54, Q: 57, R: 14, S: 20, T: 3, U: 16
Infection status; negative = 54, positive = 413
Using these parameters, the vst normalized PCA looks like this (below):
I don't particularly see any significant clustering of the groups except for blue/purple being somewhat separated from pink on PC1 and brown colors (negative infection) from the main group (positive infection) on PC2. Is the correct interpretation that that the batch variable is very weakly linked to variance in the dataset?
Intuitively, to me having 21 batches that are uneven is probably going to make finding any differences significantly harder. However, it is also intuitive that batch effects should be considered. If my interpretation is true, is this a justification for me to exclude sequencing batch from the design?
Or should I be using the LRT function to check the difference in adj. pvals between a design of 'sequence_batch + infection_status' vs 'infection_status' for my genes of interest and deciding based on that?
Thanks!
Thanks for taking the time to answer. This raises several new questions but I don't want to deviate too much from the original question so I'll break this up into two sections:
Original topic:
As far as I can tell, in the series matrix they provided, it appears that all samples were sequenced on the 75bp machine even though as you said, in their methods they wrote they used two different ones. Indeed, age and sex likely factor into the differences in gene expression but to stay on topic, let's say as a hypothetical, there are only batch and infection status considered here.
My question is more in the interpretation of the PCA plot.
1) If you saw such a result, would you interpret it as there is not a very clear batch effect? Or put another way, the primary source of variation on PC1 is not batch related. As for PC2, the only somewhat clear separation can only been seen in batches that are exclusively associated with negative infected samples, thus any batch-related effects are actually infection status related effects.
2) And if indeed, there is no clear batch effect, does this mean that there is really not much benefit to adding it to the design model? Particularly given that negative infected samples are exclusively in only a subset of batches without any corresponding positive infected samples? Should batch ALWAYS be corrected for regardless of whether or not there is batch clustering?
3) As for checking adj. pvals I was referring to the comparison between full and reduced models. From the DESeq2 vignette: "The LRT examines two models for the counts, a full model with a certain number of terms and a reduced model, in which some of the terms of the full model are removed. The test determines if the increased likelihood of the data using the extra terms in the full model is more than expected if those extra terms are truly zero." So if a full model design of: ~batch + infection_status vs a reduced model design of: ~infection_status produces p = 1e-10 for Gene X, I think that it is indicating that Gene X is different by infection status accounting for batch. Thus, batch appears to be important and the full model should be used.
Extended topic:
As a clarification, for infection time course, we can ignore that as it refers to some other data they have. For GSE152075, they do not have this data. They do have viral load as approximated by qPCR CT values of the N1 protein from the virus, however, I wouldn't necessarily say a higher viral load = further along in infection as for example, living with an infected roommate would be different to being infected by a random passer-by.
They did indeed look into age and sex DEG as well but they did it in 3 separate analyses. That is to say, they used a design of:
For infected vs non-infected DEG
Design 1: batch + infection status
For older vs younger DEG
Design 2: age(categorical 60+ or less) + infection_status + age:infection status
For male vs female DEG
Design 3: infection status + sex + infection_status:sex
I think I agree with you that even in the infected vs non-infected DEG analysis that the ideal initial design should probably have been something like: batch + age + sex + viral_load + infection status + age:infection status
These are the factors that intuitively should affect gene expression.
However, with my perhaps naive logic, I think there is also justification to check first if there is a difference in age or gender between infected and non-infected populations. If there is not much difference, a simpler design model such as design 1 would be sufficient. Downstream of DESeq2, plotting of the transformed counts by viral load grouping would then give you an idea of if any of the DEG genes are changed by viral load.
All batches (except batch C) fail to include both positive and negative samples. So it is not really possible to distinguish batch effect from virus effect anyway.
Thanks Carlo, I thought this was the case too.
Your welcome, always happy to discuss :)
1) From this plot, I personally would think, that PC1 is indeed batch related, because positives from batch pink are far right and positives from batch blue seem to concentrate to the left (on a PC1 axis that describes 37% of the variance which is not low). I agree that PC2 (11%) seems to separate more between positives and negatives.
2) Let me cite the developer Mr Love on this from a question on the Bioconductor support. “Putting batch in the design formula is our recommendation when batches are known and when they are not confounded with the condition.”. So I would take this as, yes always include the batch effect in the design, unless it gives the same sample separation as the condition (so all infected are batch1 and all uninfected are batch2). Then there is no additional benefit.
3) Ok, I never worked with the LRT, but I think you are right with what the test means. I do not know at what number of genes with adj. pvalue lower e.g. 0.05 you would say, you have to include e.g. the factor sex in your design. I would not use it on the batch effect though, since this is a technical factor, because you might have different primer batches etc.
Extendet:
Yes you are right, I was mistaken. Viral load is probably also has a greater impact on gene expression than time course of the infection (at least that’s what we saw with HIV in primary cells, where the viral load upon infection did not increase uniformly in all samples, but was phased).
Concering their setting of factors in the design for the different analysis, I would have done it like you said with all factors included and just other interaction terms at the end. That way you account for every variance inducing factor in every comparison, which would be more correct I think. They also only include the batch effect in one of their comparisons.
Yeah, if I understood correctly, you can test with the LRT if you should include sex or viral load in the design for the Wald test. But again, I never worked with it, so better read further into that.
Thanks caggtaagtat, I think to summarize:
Original topic
1) Should we include batch in the design based on this PCA?
No. Well in fact, the PCA doesn't even matter in this case because the negative infected samples are almost all in their own separate batches compared to positives. Thus, you can't delineate batch effects from virus infection effects.
2) But what about the separation we see between blue/purple batches and pinky colors on PC1?
There is batch-related variation on PC1 but is it significant enough to warrant its inclusion at the great cost to power? In my opinion, most are centrally clustered at 0. Blue is more left clustered and only a few pink are right clustered. Perhaps a good compromise would be to take only the samples clustered around -25 to 25 and exclude batch from design.
3) Should we base our decision on LRT comparisons of full and reduced models?
Based on the answer to (1), in this case no.
Extended topic
I actually meant for example just a simple boxplot for y = age and x = infection status and doing a t-test. If no difference, it suggests age does not impact on infection status and thus is not necessary to add to design?