Dear all,
My this questions is bit descriptive and i need some clarifications.
I read 1000 Genome project paper entitle: "A global reference for human genetic variation". http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html#supplementary-information
In the supplementary information "9.2 Callable genome mask", authors provided two types of Accessible genomic mask regions: "Pilot" and "Strict". The reasons of generating such regions were: (i.e. quoted)
"Due to the nature of short-read sequencing, the sequencing depth varies along the length of the genome. As such, not all regions of the genome will have equal power for variant discovery. To provide an assessment of the regions of the genome that are accessible to the next-generation sequencing methods used in Phase 3, we created two genome masks".
These most recent version of bed files are provided here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible_genome_masks/
My purpose: I want to use these genomic masks (hard filters) in my variant filtering step. So, while performing population genetic analysis (such as estimates of mutation rate) that must focus on genomic regions with very low false positive and false negative rates were performed correctly.. (in short to get high confidence variant sites of genome)
Problem: 1. I am bit confused either to focus the variants present in these regions (i.e. accessible mask regions) or ignore the variants present in these regions?? I had 9556898 total bi-allelic SNPs in total. I used "20141020.strict_mask.whole_genome.bed" file and filtered-out 6611479 variants present in these regions and finally got 2945419 remaining variants behind. In this way more than 50% variants were lost.
I am confused either the SNPs present in these regions we have to consider or filter-out, may be i am confusing the "mask" keyword associated with the files.
Can anyone explain and help me how to use these files??
Thank you very much for patience and help in advance!