Question

Periodicity in SNP calling quality - is this normal?

3

Entering edit mode

9.9 years ago

John 13k

Hello all!

After calling SNPs on some RNA data to produce a gVCF file, I ran a little python script to see the distribution of calling quality across the different genotypes:

x-axis is quality score rounded to the nearest integer
y-axis is the number of variants at that quality score

As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice. What I didn't expect however is the periodicity. Is that normal?

Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75?

Code to generate this data:

#!/usr/bin/env python2.7
import collections
with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f:
    data = {}
    for line in f:
        if line[0] == '#': continue
        line = line.split('\t')
        if line[5] == '.': continue
        gt = line[9][:3]
        try: data[gt][int(float(line[5]))] += 1
        except KeyError: data[gt] = collections.defaultdict(int)
for gt,qualities in data.items():
    print '\n',gt
    for qual,count in sorted(qualities.items()):
        print qual,count

UPDATE

The Dev team at GATK say this is normal and not something to worry about :)

Apparently one should not filter a gvcf directly, but rather call variants first using GenotypeGVCFs, then filter in the more standard way on that output. GATK forum post: http://gatkforums.broadinstitute.org/discussion/6216/periodicity-in-variant-calling-quality-is-this-normal

RNA-Seq VCF GATK SNP • 2.1k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.9 years ago by John 13k

2

Entering edit mode

Interesting observation, I've actually seen this exact thing before...also with GATK's haplotype caller, though I don't recall if I was looking a variants from RNAseq or DNAseq reads. I've sort of assumed that this is some oddity of the algorithm.

ADD REPLY • link 9.9 years ago by Devon Ryan 105k

1

Entering edit mode

Yeah I guess so - I'll go ask over at the GATK forums since this is probably something to do with their caller and not, hopefully, my data :P ;)

I will update this thread if I hear anything from them about this, and update the other thread (about gvcf calling for RNA data) once I've got a working pipeline!

ADD REPLY • link updated 2.9 years ago by Ram 45k • written 9.9 years ago by John 13k