Question

RNA-seq to look at the Effect of Viral Infection

0

Entering edit mode

6 months ago

Nicholas • 0

Hello,

I am doing an RNA-seq analysis of virally infected cells. My sample table looks something like this

Sample	Dose	hours
1	0	24
2	0.1	24
3	5	24
1	0	48
2	0.1	48
3	5	48

My question is about the effect of the virus on the analysis itself. In the dose 0 and dose 0.1 samples there are few to no viral transcripts, so almost all of the sequencing depth went to human genes. Whereas in the dose 5 samples, 50% or more of the reads map to viral genes. I am not interested in the viral genes at all, merely the effect of the virus on the host cell.

I used STAR to align to the human genome, so most of the transcripts that map should be human, except possibly for some which are similar.

But because half of the transcripts in the 5 samples are viral, I'm worried the effective difference in sequencing depth on the human genome will cause an artificial difference in differential expression, even after normalizing based on estimated size factors. On a PCA plot the samples with the higher number of viral transcripts (which, depending on the viral species, is sometimes independent of the initial viral dose), the samples with more viral transcripts clearly separate out.

When I do a basic contrast between the 5 dose and the zero dose, I get almost 30k genes as DE. That should be most protein coding genes or about half of all genes.

How can I be sure the difference is a biological one and not an artifact?

virus DESeq2 differential-gene-expression • 1.1k views

ADD COMMENT • link updated 6 months ago by swbarnes2 15k • written 6 months ago by Nicholas • 0

0

Entering edit mode

How many reads do the samples in the end have, after all filtering? It becomes a problem when counts are so low that this group has many zerops due to insufficient depth. Then you would need to resequence. You can color your PCA by depth and see whether depth is a major driver of variation as well.

ADD REPLY • link 6 months ago by ATpoint 88k

0

Entering edit mode

Thanks for getting back to me ATpoint (ATpoint). My apologies for not responding earlier. I took some vacation plus another project.

Anyway, there are about 9 million uniquely mapped reads on the lowest mapped one. And about 52 for the highest. The PCA idea makes sense, but the depth correlates very much with the number of viral reads, so it's hard to determine whether depth is the issue or legitimate viral activity.

It's unclear to me how DESeq2 estimates size factors. Does it look for particular genes and normalize based on them?

When I examine the dds object, the number of genes with zero counts is inversely proportional to the size factor. So samples that had almost all human reads and a much higher size factor have far more genes with all zeros in their counts. Genes with fewer human reads had far fewer zeros. That seems very strange to me. But it does account for the differences we see in expression.

ADD REPLY • link 6 months ago by Nicholas • 0

0

Entering edit mode

DESEq is going to find the gene whose expression is in the "middle" (using the geometric mean of each gene across all samples) and assume that gene is constant across all samples, and normalize to make that constant. That assumption may not be valid in your samples.

ADD REPLY • link 6 months ago by swbarnes2 15k

score 0 · Answer 1 · 2024-12-04

0

Entering edit mode

6 months ago

Asaf 10k

I would initially map each sample to the viral genome, remove the reads that map to the virus and map the rest against the human genome. Also, how did you normalize the reads?

ADD COMMENT • link 6 months ago by Asaf 10k