Periodicity in SNP calling quality - is this normal?
0
3
Entering edit mode
9.2 years ago
John 13k

Hello all!

After calling SNPs on some RNA data to produce a gVCF file, I ran a little python script to see the distribution of calling quality across the different genotypes:

  • x-axis is quality score rounded to the nearest integer
  • y-axis is the number of variants at that quality score

As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice. What I didn't expect however is the periodicity. Is that normal?

Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75?

Code to generate this data:

#!/usr/bin/env python2.7
import collections
with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f:
    data = {}
    for line in f:
        if line[0] == '#': continue
        line = line.split('\t')
        if line[5] == '.': continue
        gt = line[9][:3]
        try: data[gt][int(float(line[5]))] += 1
        except KeyError: data[gt] = collections.defaultdict(int)
for gt,qualities in data.items():
    print '\n',gt
    for qual,count in sorted(qualities.items()):
        print qual,count

UPDATE

The Dev team at GATK say this is normal and not something to worry about :)

Apparently one should not filter a gvcf directly, but rather call variants first using GenotypeGVCFs, then filter in the more standard way on that output. GATK forum post: http://gatkforums.broadinstitute.org/discussion/6216/periodicity-in-variant-calling-quality-is-this-normal

RNA-Seq VCF GATK SNP • 1.9k views
ADD COMMENT
2
Entering edit mode

Interesting observation, I've actually seen this exact thing before...also with GATK's haplotype caller, though I don't recall if I was looking a variants from RNAseq or DNAseq reads. I've sort of assumed that this is some oddity of the algorithm.

ADD REPLY
1
Entering edit mode

Yeah I guess so - I'll go ask over at the GATK forums since this is probably something to do with their caller and not, hopefully, my data :P ;)

I will update this thread if I hear anything from them about this, and update the other thread (about gvcf calling for RNA data) once I've got a working pipeline!

ADD REPLY

Login before adding your answer.

Traffic: 2357 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6