I am looking at the 1000 Genome data found here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/exon/snps/
And I am trying to count the number of SNPs for CEU individuals that exist within this list: http://pastebin.com/JUwNLh9E
Each time I try, my count is almost double what is expected. I used vcftools:
vcftools --vcf CEU.exon.2010_03.genotypes.vcf --keep keep.txt --out vcfoutput/CEU_targets --freq --recode
Where keep.txt has the list in pastebin.
And then I looked at the number of lines in the recoded file because each line should represent a SNP. It has 3489 lines without the header, but according to a paper that I am referencing (table 2) there should only be 826 between this data and HapMap data. Why are my numbers excessively high?
Thanks in advance! -Matt
EDIT: I am counting the SNPs correctly, but I don't know how to restrict my ROI. The paper states "we restricted the analysis to the 470 kb of sequence that overlapped with the exon capture boundaries of the 1000 Genomes pilot project". I'm not sure how to do this, so if anyone has some insight, it would be greatly appreciated!
That cut it down to 2274, but this is still significantly higher. Thanks for your help. Any other thoughts?