Question

What is appropriate coverage/quality filter for GATK for germline variants?

1

Entering edit mode

10.0 years ago

sviatoslav.kendall ▴ 920

I'm looking at germline variant profiles of cancer patients that have been processed using the GATK pipeline. My lab is looking for mutations that might be useful as biomarkers.

At the moment, I am trying to decide what a good minimum coverage/quality filter would be and would appreciate any suggestions, links, etc on how to make this call. I understand that more coverage, higher quality basically means that the reads are more trust-worthy and I've overheard people say that a Q-score of 30 or more is a good cut-off at journal clubs and paper presentations but I don't really know why or how to decide if its suitable for my purposes.

sequencing Coverage genome GATK • 4.0k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by sviatoslav.kendall ▴ 920

0

Entering edit mode

Did you do whole genome sequencing or whole exome sequencing? What is your expected coverage across the genome or across the capture targets?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by donfreed ★ 1.6k

0

Entering edit mode

This was whole exome sequencing. I don't know what "expected coverage" means...

ADD REPLY • link 10.0 years ago by sviatoslav.kendall ▴ 920

1

Entering edit mode

In our lab we typically get whole exome sequencing with ~80 million 100bp reads with a target region for exome capture of ~80 million bp. Approximately 80% of our sequence reads align to the exome capture targets so we end up with an average coverage (or expected coverage) of 80x across target regions.

80 million * 0.8 / 80 million * 100 = 80

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 10.0 years ago by donfreed ★ 1.6k

0

Entering edit mode

I see. I'm coming into a project after the sequencing has been done, pumped through a GATK pipeline and dumped into a relational database. My aim at this point is to dismiss from consideration variants that have too low coverage/quality to be worth considering as candidates.
Your comment suggests to me that I might want to see what the distribution is of coverage across all the variants and make a call based on that. I, perhaps naively, had the idea that coverage was a standard value and that once you passed a certain threshold you could be reasonably sure of the sequence as long as the quality score was also high enough.

ADD REPLY • link 10.0 years ago by sviatoslav.kendall ▴ 920

Ram · Answer 1 · 2014-11-24

There is no real correct answer to your question. If you want more variants and are willing to accept more false positives then you can use a lower coverage as a cutoff (4x) if not use a higher coverage (20x).

Q-30 refers to the base quality of a particular base on a particular read. A single base quality is not very informative for filtering variants. Other metrics such as the BaseQualityRankSumTest or FisherStrand are much more informative. If you can, you should use the VariantAnnotator to annotate your vcf with many different metrics of variant quality and then run the VariantRecalibrator.