Hello everyone,
For context, I have been working on an evaluation of different variant calling programs (LoFreq, HaplotypeCaller, Strelka2...), comparing their ability to identify SNPs and indels in RNA-seq libraries. The metric I am most interested in is their recall. For comparison, I use a set of variants called from paired whole exome sequencing (WES) libraries. I have limited the comparison to the regions that are well covered (100x) by both technologies and variants with above 5% frequency. For the most part, one would expect to find very similar variants in RNA-seq and WES datasets, so that most of the variability comes from the pipeline used to call said variants. However, I have found that there are some variants, specially indels, that are seen in WES but not RNA-seq, independently of the software used.
I even checked the bam files with IGV. I found that in some cases, these mutations are absent in the RNA-seq libraries, but present in a good number of the WES libraries (samples size for WES libraries is 6). In other cases, there is a mutation in both kinds of libraries, but it is not the same mutation. Examples of both instances can be seen in the following screenshots, with WES library at the bottom track and their paired RNA-seq libraries above.
Have you ever seen something similar? What could be an explanation for this? Any suggestions for what to look for would be extremely helpful. Thanks a lot.
An indel can lead to a non-functional transcript which is to be destroyed via Nonsence-mediated decay
Expression can be allele-specific too