I'm doing analysis of variant calling pipeline (VCP) results of human exome in order to achieve easier data inspection and better data representation for, e.g., personalized medicine. At the moment, a typical number of SNPs provided by the VCP is >100k. I would like to filtrate the results according to known variation in human population.
The filtration described above is conducted in the most studies which, e.g., try to predict drug sensity according to the NGS results. However, the tools used (whether there are any) are not given. In their excellent article The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity, Barretina et al. give following details in the supplementary material.
Variant filtration by exclusion of common germline variants: Variants for which the global allele frequency (GAF) in dbSNP134 or allele frequency in the NHLBI Exome Sequencing Project (http://evs.gs.washington.edu/EVS, data release ESP2500) was higher than 0.1% were excluded from further analysis.
Is the NHLBI Exome Sequencing Project batch query tool the best tool to do this kind of filtration? Can you also comment the threshold frequency (0.1%) which was used in the study?
Variant filtration by exclusion of variants observed in a panel of normals: Variants detected in a panel of 278 whole exomes sequenced at the Broad as part of the 1000 Genomes Project were excluded from further analysis. Beyond removal of additional germline variation, this step also allowed elimination of common false positives that originate predominantly from alignment artifacts.
Here, any particular tool is not given. Are there any freely available tools for conducting such filtration?
More advanced topic: Implementing this kind of filtration on earlier stages of VCP is an interesting idea already discussed among computer scientist (see. e.g. http://arxiv.org/abs/1010.2656). Are there any existing VCP able to do this kind of calling against several reference genomes? What do you think of this idea in general?