Entering edit mode
8.3 years ago
ebrahimiet
▴
50
Hi all,
I am performing denovo genome assembly of NDV virus by paired end 150 bb Illumina reads. NDV has Negative-stranded RNA linear genome, about 15 kb in size. When I compare the final contig with NCBI deposited full genome, I see that the beginning (leader/promoter) and end of RNA genome is not present in finall denovo assembled contig. How I can enrich the contigs for beginning and end of viral genome?
many thanks
Esmaeil
RNA-Seq tends to have poor performance in the ends of viral genomes (for a range of reasons). There's nothing you can do to enrich for reads in those regions because they don't exist.
If you want the complete ends you'll have to use RACE or something.
@joe: Don't want to hijack this thread but do you have/know of references that show the poor performance of viral RNAseq?
There's not poor performance overall, it works very well, the genomic termini just tend to be a pain in the ass and you usually have to go after them with RACE or something. This may not apply to all types of viruses, it does seem to be the case for (+/-)ssRNA viruses (except maybe Deltaviruses). The explanation I've always gotten is that the typically large amounts of secondary/tertiary structure in these regions leads to issues during cDNA generation.
I've certainly seen it in every +/-ssRNA virus we've sequenced. I can get 10000-150000x coverage inside the genome, but when I get to the 5' or 3' ends the coverage drops rapidly and I'm usually missing the first/last 50-200bp. It does seem to be a function of depth: the deeper the coverage, the more the termini tend to be covered.
There's not much on why, but genome sequences aren't usually considered complete without RACE to obtain the termini. http://mbio.asm.org/content/5/3/e01360-14.full http://msystems.asm.org/content/1/3/e00039-15 http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-444 http://www.sciencedirect.com/science/article/pii/S0042682210002321
I am currently working with someone trying to define ends of some transcripts for a virus (not at the beginning/end of the genome) and RNAseq data has been partially inconclusive (there is no smoking gun 5'-start though things are better on 3'-end). Viruses are so gene rich that it is difficult to tease transcripts out. I was suspecting that something like RACE may have to be done to nail the starts down since RNAseq alone does not seem to cut it.
Thanks for the papers and your answer.
It doesn't surprise me that the viral transcripts are giving your problems. If your virus will polyA its mRNAs you might have an easier time doing race for the 3' end.