Has anyone tried calling variants from RNA-seq data and comparing those with WGS/Exome sequencing variant calls in coding regions?
I was curious to know if the same variant callers can be used on RNA-seq alignment (say TopHat alignments).
Also, if there are tools that can predict RNA-editing or similar events.
If you're interested in inferring RNA-editing from RNA-seq, you should be sure to read the responses to the Li et al Science paper on that topic published recently in Science and commentary on that topic published on the Genomes Unzipped blog.
From Broad Institute on their RNA-seq variant calling pipeline:
Finally, we know that the current recommended pipeline is producing
both false positives (wrong variant calls) and false negatives (missed
variants) errors. While some of those errors are inevitable in any
pipeline, others are errors that we can and will address in future
versions of the pipeline.
Another benchmark, from studies that came out since my colleagues posted their answers (below):
The situation appears even more alarming when one reads anecdotal and
published evidence of people who have compared RNA-seq variant calls
to whole exome seq (WES) variant calls. Scattered across the WWW, I've
seen that RNA-seq variant calling can only detect between ~30% and
~70% of the variant calls that WES detects, and I assume that these
people have obviously filtered the WES data to only include variants
in exons in their comparisons.
You should check out the SNVMix papers here and here. They developed and used their method on RNA-seq tumor data and compared to "ground-truth" of genotype arrays and WGS. They also showed their approach could identify RNA-editing events. And, they have a follow-up method for matched tumor-normal samples called JointSNVMix. Although I think the latter was developed more for exome-seq.
We've done quite a few variant calling from mRNA-Seq data for EMS mutant identifications. But we haven't compared with WGS/Exom yet. We used BWA for mapping, and samtools as well as GATK pipeline for variant calling. Both yielded pretty consistent results. One thing turned out to be very important for our purpose, i. e. detecting high quality SNPs in the coding regions, is that you have to trim aggressively to remove bases of bad quality, even at the cost of losing coverage in some areas. With really stringent quality trimming, we've successfully identified several mutant alleles that can be verified by Sanger sequencing or restriction enzyme genotyping.
Thanks, we are following a similar approach as well so good to know we are not alone :). What reference did you use for bwa alignments? Custom transcriptome using known transcripts (>150,000?) Or some trick to use spliced alignments using bwa?
We've been using predicted CDSs as references, because our system was not highly annotated, we ignored alternative transcription for the moment. I tried genome mapping, too, and got very similar results as you'll lose 5% junction reads.
what would you call consistent results? We routinely see over-representation of FPs near the splice junctions for rnaSeq SNV calls. And this is comparing the data to dna-seq making sure variants have good enough coverage of reads for a confident SNV call
Well, in our system, or more properly, evolutionary scale, CNVs are very rare. SNPs and indels are the vast majority of variant types. I have no idea and experience of CNVs.
If you're interested in inferring RNA-editing from RNA-seq, you should be sure to read the responses to the Li et al Science paper on that topic published recently in Science and commentary on that topic published on the Genomes Unzipped blog.
I don't believe any of us should be encouraging variant calling from RNA-seq. Here is why:
A: Inferring genotype based on RNA sequnces
From Broad Institute on their RNA-seq variant calling pipeline:
Another benchmark, from studies that came out since my colleagues posted their answers (below):