RNA variant calling, aligned with HiSat2
2
0
Entering edit mode
3.4 years ago

Trying to do my variant calling on RNA paired tumour-normal pairs with Mutect2.

Reads were aligned with Hisat2 and alignment is 97-98%.

I ran Mutect 2 and it gave me about 100 SNPs and 100 indels (across the whole exome). I validated these against known variants from previous targeted sequencing on the same subject.

I then went back and did some preprocessing- marking duplicates, splitNCigar and BQSR.

The SNPs are about the same- slightly fewer but probably just increased sensitivity- but I now have 5000 indels.

I have a feeling that SplitNCigar should not be used on Hisat2 aligned reads and this is the reason for these 'indels'.

Has anyone else found similar? Can I variant call using Hisat2 aligned reads, and just skip the SplitNCigar step?

hisat2 gatk splitncigar • 1.9k views
ADD COMMENT
2
Entering edit mode
3.4 years ago

The reason you have 5000 indels is probably because some spliced reads are now (for whatever reason) being considered indels. Since GATK is mainly used for DNA-seq, then it's not really a surprise.

The SplitNCigar step is needed for running BQSR but to what extent this is even helpful for RNA-seq is unclear.

You could be correct in thinking that HISAT2 alignments don't have the right flags/scores to be compatible with GATK's SplitNCigar, to verify this I would suggest re-aligning with STAR (2-pass mode) as suggested in the link from ATpoint. I do recall that bowtie had some issues with GATK so it could be that those also exist in Hisat2.

Either way I would be extremely skeptical of any indels from RNA-seq data... probably safest to focus on SNVs. If you are extremely interested in specific INDELs, I would make sure to go look at them in IGV and verify that they are real before moving forward.

ADD COMMENT
1
Entering edit mode

I far as I understand it, SplitNCigar was designed specifically to make data alignmed with a splice-aware aligner compatible with the GATK pipeline. Its clearly helpful for RNAseq because it takes a spliced read and turns it into two non-spliced reads.

ADD REPLY
0
Entering edit mode

Thanks- I think after some discussion that the best thing to do is just focus on SNVs.

ADD REPLY
1
Entering edit mode
3.4 years ago
ATpoint 85k

As the process of variant calling in RNA-seq is even less standardized as compared to DNA-seq, and given that RNA-seq is inferior than DNA-seq for it, I would at least (and simply) follow the guidelines Broad has put out:

https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-

Not saying it is perfect (in fact I do not know) but at least it is a path to follow and a somewhat citable reference.

ADD COMMENT

Login before adding your answer.

Traffic: 1636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6