Call variant from RNA-Seq data using Haplotypecaller
1
2
Entering edit mode
8.1 years ago
CY ▴ 750

My understanding is that Haplotypecaller is specifically designed to call variant from DNA-Seq data. However, RNA-Seq has different (more complicated) allele frequency than DNA-Seq. So my question is:

  1. How comes that Haplotypecaller is still being used in RNA-Seq variant calling pipeline?
  2. Cosidering the complication in allele frequency in RNA-Seq data (genomic imprinting or allele imbalance), how possible that a variant caller can confidently call either homo or hetero variant?
  3. If tumor sample is used, the allele frequency will be even more complicated (purity, heterogeneity and structural variants). No currently available tool can make the confident call, right?

Any comments on these questions are really appreciated!

RNA-Seq variant call GATK • 5.5k views
ADD COMMENT
0
Entering edit mode

Isn't the third point (regarding purity, heterogeneity and structural variants) also equally applicable to DNA-seq as well as RNA-seq?

ADD REPLY
0
Entering edit mode

Yes. They are. My point is that there are too many complications involved to accurately call variants from RNA-Seq of tumor sample.

ADD REPLY
0
Entering edit mode

Even on DNA-seq data, GATK misses many genuine variant calls ('genuine' = confirmed by Sanger). On the other hand, samtools / bcftools mpileup can easily call these. The GATK 'engine' has never been quite right, but they stuck with it without bench-marking against Gold standard technologies in clinical genetics. From that, came Google's DeepVariant, which I would argue on face value is worse than GATK.

See answer here: A: Inferring genotype based on RNA sequnces

ADD REPLY
2
Entering edit mode
8.1 years ago

Variant calling in RNA seq data still remains one of those "just because you can, doesn't mean you should" cases in my eyes. GATK's guide on performing variant calling on RNA seq was probably in large part due to a lot of people asking the question, and the developers taking a good stab at the problem. RNA seq data is not experimentally orientated towards genotypes, so it's far from optimal conditions for calling. I think it's useful for cases such as data gathering, in experiments that you've already carried out, or possibly hypothesis validation.

ADD COMMENT
0
Entering edit mode

In that case, the RNA-Seq variants called by Haplotypecaller are far from accurate. Then why the RNA-Seq variant calling pipeline is in the GATK best practices and many people are using it?

ADD REPLY
1
Entering edit mode

Well, that depends on a lot of things. The best practises from the Broad / GATK is from what I've seen, the best protocol to go from RNA seq to genotype calling, and they're the only collective organisation that have put up such detail about the process. The GATK developers ever went so far as to re-engineer parts of GATK, and invent new tools (such as split'n'trim) to tackle the problem. If you read the caveats section of this article, they state that they know there are a lot of false positives passing through, and that hard filters aren't perfect. GATK's current variant calling (for DNA seq), requires lots of ground truth sets (VQSR), to accurately call genotypes, however those don't exist for RNA seq data, so you have to use hard filtering as the only alternative. The bottom line is that people use it because they trust the Broad / GATK (rightfully), and there aren't any better alternatives, this is not to mention that you're trying to use RNA seq data for a purpose that it was never intended for.

To tackle your questions one by one:

1: The Haplotypecaller is still the most advanced way to call halplotypes, that is currently engineered. 2: That's what genotype quality metrics are for, only if the genotype can be confidently called, will it pass filters. Granted, this isn't always the case, but as I explained previously, VQSR is better than hard filters for filtering out false positives. 3: You're adding a whole other layer of complexity here, when looking at cancer samples. You're correct in that there's no optimal process to go from tumour / normal paired RNA seq, to genotype calls.

Overall, the Broad / GATK's pipeline for processing RNA seq to genotype calls is a nice to have tool, that's very useful in certain conditions, and props to the people that developed it. It's no substitution for doing experiments correctly, i.e. if you want to look at haplotypes, then do DNA sequencing, if you want to look at expression, then do RNA sequencing.

ADD REPLY
0
Entering edit mode

Thanks for the detailed explanation. It is very helpful :)

ADD REPLY
0
Entering edit mode

I am not taking any sides here, but I've heard the argument that RNA-seq is actually more accurate than DNA-seq for variant calling because the variant you are detecting is the one that is actually being expressed.

ADD REPLY
1
Entering edit mode

I don't agree with those variants to be more 'accurate'. You could argue that those are more relevant, because those are the ones being expressed. But by that spirit you would never identify a nonsense mutation leading to nonsense mediated mRNA decay and pathogenic haploinsufficiency. Biologically, the most interesting could be to perform allele specific expression analysis by integrating exome/genome sequencing with transcriptome sequencing.

ADD REPLY
0
Entering edit mode

Yes, Your way could be a solution. Could you share more details on the 'allele specific expression analysis'? maybe a recommended pipeline or something. That will be really helpful :)

ADD REPLY
0
Entering edit mode

See the discussion here

ADD REPLY
0
Entering edit mode

Since we use GATK, a DNA-Seq variant caller with a probability model based on DNA allele frequency, to call variant on RNA-Seq, I imagine there will be lots of false negative (no expressing allele missed and allele with weird frequency in the sense of DNA allele frequency missed). However, there will be so many false positive in the result. Correct me if I am wrong

ADD REPLY
0
Entering edit mode

I think you are right only when the tool can call them accurately. It is hard to imagine that any tool currently available can overcome genomic imprinting and allele imbalance and make accurate call without additional info (well.. I would like to open a discussion on the way to overcome these issue). Besides, expression is dynamic, missed variant due to not expressing does not mean they won't express sometime later.

ADD REPLY

Login before adding your answer.

Traffic: 1848 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6