Hello,
I am doing an RNA-seq analysis of virally infected cells. My sample table looks something like this
Sample | Dose | hours |
---|---|---|
1 | 0 | 24 |
2 | 0.1 | 24 |
3 | 5 | 24 |
1 | 0 | 48 |
2 | 0.1 | 48 |
3 | 5 | 48 |
My question is about the effect of the virus on the analysis itself. In the dose 0 and dose 0.1 samples there are few to no viral transcripts, so almost all of the sequencing depth went to human genes. Whereas in the dose 5 samples, 50% or more of the reads map to viral genes. I am not interested in the viral genes at all, merely the effect of the virus on the host cell.
I used STAR to align to the human genome, so most of the transcripts that map should be human, except possibly for some which are similar.
But because half of the transcripts in the 5 samples are viral, I'm worried the effective difference in sequencing depth on the human genome will cause an artificial difference in differential expression, even after normalizing based on estimated size factors. On a PCA plot the samples with the higher number of viral transcripts (which, depending on the viral species, is sometimes independent of the initial viral dose), the samples with more viral transcripts clearly separate out.
When I do a basic contrast between the 5 dose and the zero dose, I get almost 30k genes as DE. That should be most protein coding genes or about half of all genes.
How can I be sure the difference is a biological one and not an artifact?