I'm looking at germline variant profiles of cancer patients that have been processed using the GATK pipeline. My lab is looking for mutations that might be useful as biomarkers.
At the moment, I am trying to decide what a good minimum coverage/quality filter would be and would appreciate any suggestions, links, etc on how to make this call. I understand that more coverage, higher quality basically means that the reads are more trust-worthy and I've overheard people say that a Q-score of 30 or more is a good cut-off at journal clubs and paper presentations but I don't really know why or how to decide if its suitable for my purposes.
Did you do whole genome sequencing or whole exome sequencing? What is your expected coverage across the genome or across the capture targets?
This was whole exome sequencing. I don't know what "expected coverage" means...
In our lab we typically get whole exome sequencing with ~80 million 100bp reads with a target region for exome capture of ~80 million bp. Approximately 80% of our sequence reads align to the exome capture targets so we end up with an average coverage (or expected coverage) of 80x across target regions.
I see. I'm coming into a project after the sequencing has been done, pumped through a GATK pipeline and dumped into a relational database. My aim at this point is to dismiss from consideration variants that have too low coverage/quality to be worth considering as candidates.
Your comment suggests to me that I might want to see what the distribution is of coverage across all the variants and make a call based on that. I, perhaps naively, had the idea that coverage was a standard value and that once you passed a certain threshold you could be reasonably sure of the sequence as long as the quality score was also high enough.