Normalization of virally infected human cells
0
0
Entering edit mode
5.0 years ago
rrdavis ▴ 60

Hi,

I have been given some RNA-seq data from a human cell line infected with a virus that can be activated. Upon activation, we see a large drop of reads mapping to the human genome (it goes from from ~75% to %40 upon activation). There is a large shift from human to virus expression upon activation. We decided to map the data to a combined genome (human + virus) and get the counts from both.

I am stuck deciding how to do the DE analysis. We are looking at the DE in host and viral genome. We have triplicates of each condition. Since we have a global shift in expression from host to virus what is the best way to normalize these? My initial thought was to import the counts into DESeq2 following the recommendation from Devon Ryan's response to this Biostars question

Would we be violating the assumption that there is no global change?

There is another question similar to this one addressed here by Carlo Yague. It looks like the best normalization would be using spike-ins in this case. Unfortunately, this data does not have spike-in data.

What is the best way to proceed? What are your thoughts?

Thanks!

RNA-Seq • 1.0k views
ADD COMMENT
0
Entering edit mode

we see a large drop of reads mapping to the human genome

Is this because you are keeping total number of sequenced read similar for all samples or % of viral reads is significantly going up?

ADD REPLY
0
Entering edit mode

Every sample was sequenced to the same depth. The drop I refer to is because the % of viral reads goes significantly up!

ADD REPLY
0
Entering edit mode

I would probably map against the combined genomes but then only use the expression values from the host to plug it into DESeq2. Given that the counts are still high enough that you do not see a dropout of lowly-expressed genes I think you should be able to use the default normalization. You can explore this using MA-plots where (if everything goes right) you should see the bulk of datapoints centered at zero. I think what you have is a global change in RNA composition of the library (because the virus is flooding it with transcripts) but not a global expression change in the host (the latter actually being the relevant assumption for DESeq2 from what I understand). If the latter is true and you do not see a global change in host expression the host-only expression values should still obey the underlying assumption.

ADD REPLY
0
Entering edit mode

Thank you for your response. I will generate the MA-plots in DESeq2 for the host data and attach those here. If the MA-plots look typical, how should I proceed with the viral DGE? There are close to 100 genes in the viral genome. If the host data (MA plots look normal), Should I import the viral + host counts and perform the DE on both?

ADD REPLY
0
Entering edit mode

Is it meaningful to do DGE on viral genes which are entirely absent/inactive in one condition?

ADD REPLY
0
Entering edit mode

True. I see little counts before activation so like you said before and after activation differences on the viral side might not make sense. In an effort to keep things simple, I failed to say that I have one three conditions: one before activation and 2 conditions post activation ( these two conditions are treated with two different drugs)

ADD REPLY
0
Entering edit mode

A statistician would probably whip me for saying this but imho statistics are good (and of course necessary) to make decisions when data are not bimodal and by this easy to separate, but probably not necessary here for the viral genes. If you show a plot of your viral counts (non-activated) and this is always like zero and after induction it is skyrocketing, I do not see why one would need fancy statistics to make a claim that the virus is active. Depends of course on how your data look but if you have a clear and strong separation between the counts between the induction states maybe you can even skip differential analysis and only perform DEG for the host.

ADD REPLY
0
Entering edit mode

Here is a summary of the results table for one of the comparisons between activated and non-activated on read counts from human only.

 out of 27788 with nonzero total read count
    adjusted p-value < 0.05
    LFC > 0 (up)       : 4519, 16%
    LFC < 0 (down)     : 3800, 14%
    outliers [1]       : 0, 0%
    low counts [2]     : 8589, 31%
    (mean count < 1)
    [1] see 'cooksCutoff' argument of ?results
    [2] see 'independentFiltering' argument of ?results

Here is the MA plot MA plot

While there are more changes than I am used too, @ATpoint is correct that this is more of a change in composition of the RNA library than global expression change.

ADD REPLY

Login before adding your answer.

Traffic: 1072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6