Question

how to detect mutation from viral RNA-seq data

0

Entering edit mode

6.7 years ago

babasaraki ▴ 50

Hi All,

I have developed a persistent infection (PI) on cancer cells using NDV and then we sequenced the cells (PI cells) containing NDV viral genome in them. I am just wondering if it's possible for us to detect mutations particularly on the viral genome from the RNA-Seq data of the PI cells since the viral genome is there in the PI cells. Any idea on how to approach this?. Thank you for your positive suggestions.

RNA-Seq sequencing SNP R • 3.3k views

ADD COMMENT • link 6.7 years ago by babasaraki ▴ 50

0

Entering edit mode

This is similar to dualRNAseq experimental setup. Let me first clarify a couple of "prerequisites":

Are you sure that in your RNAseq data the viral RNA is present? If, for example, you did poly-A selection, and your viral transcripts are not polyadenilated, then it might not be possible to capture viral data.
Do you have any kind of replicates?
What is the sequencing depth of your samples? More or less how many reads do you have?

ADD REPLY • link 6.7 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

Hi Sir. Thank you for your questions and offering help in this situation.

Yes, I am optimistic that the RNA-seq data contained the viral genome in it in that, after developing the PI cells, I run RT-PCR to confirm the presence of NP, HN and F genes of the virus prior to submitting the RNA for sequencing.
Yes, I did a triplicate experiments for each samples including control cells (non-PI cells).
The sequencing read is 30M

ADD REPLY • link updated 6.7 years ago by Ram 45k • written 6.7 years ago by babasaraki ▴ 50

0

Entering edit mode

Please see my answer below.

ADD REPLY • link 6.7 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

My understanding is that TCGA was able to use RNA-seq to detect HPV infection in their papers (most likely in head and neck cancer paper), so I guess it is possible to detect viral genome from the RNAseq. If the NDV genome is very conserved, you might as well first align it to the NCBI viral reference and see if you hit anything first. If the NDV genome isn't too long, you can just scroll through the alignment graph using IGV and see if there are any obvious variants.

ADD REPLY • link 6.7 years ago by btsui ▴ 300

0

Entering edit mode

Thank you Sir adding to this discussion. Can you please share with me the paper where they use the RNA-seq data to detect HPV infections?. Thank you once again.

ADD REPLY • link 6.7 years ago by babasaraki ▴ 50

0

Entering edit mode

Hi Dr. Hovhannisyan. Thank you for this lengthy and comprehensive stepwise pipelines to follow for my stated problem above. I really appreciated your time, contribution and committed on this.

I am actually very new to NGS data analysis. Though I attended quite a number of workshops and seminars on NGS particularly the RNA-seq and currently being enrolled in BioStar online classes to learn more. However, I understand all the steps you mentioned here. I will try to do as you recommended. Will get back to you if I am stuck alone the way.

Indeed, this will really be very much helpful to many. I copied this your guide on MS word and kept in my system to serve as reference material. Glad to have you here in this platform.

Best regards

ADD REPLY • link 6.7 years ago by babasaraki ▴ 50

score 2 · Accepted Answer · 2018-08-22

I don't know if you have expertise in NGS data analysis, so I will just briefly describe the steps of the pipeline you need to do. But feel free to ask details if you need. You can detect mutations (SNV and indels) using the following steps:

1.You need to map the reads against reference genomes.

Since your viral and human reads are in the same file, you will need to separate these data from each other. To do this, you need to concatenate corresponding reference genomes (human+viral) and map your reads against this concatenated reference genome. Since viral and human genomes are (obviously) very different, you will separate human and viral reads. Suppose you did this step and obtained the bam files.

From here on you can do 2 things:

2a. As mentioned in the comments, if the viral genome is very small, you can use IGV software to visualize the bam files and basically manually inspect if there are any mutations in viral alignments. If you see that there are actually a lot of them or the viral genome is large and its not feasible to do manual checks, proceed with the next step.

2b. On these step you will need to call variants based on your bam files, using GATK software. It has a chapter of best practices when dealing with RNAseq data. The easiest way would be to perform variant calling on the entire bam files, which will give you information about human variants as well. If you really need to do the analysis only of viral data, you will need to subset your bam files.

After applying GATK, you will get a VCF file, containing the variants.

NOTE: You have mentioned there are 30 mln reads per sample. With these sequencing depth you can "recover" highly and moderately expressed human genes, but lowly expressed genes probably would not get many reads. I don't know at which extent the viral genome is expressed, but after the step 1 just check how many reads mapped to viral genes. If the numbers will be very low, it wouldn't make much sense to continue the analysis.

Hope this helps