Counting Snps From 1000 Genome Data
1
2
Entering edit mode
12.4 years ago
Matt W ▴ 250

I am looking at the 1000 Genome data found here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/exon/snps/

And I am trying to count the number of SNPs for CEU individuals that exist within this list: http://pastebin.com/JUwNLh9E

Each time I try, my count is almost double what is expected. I used vcftools:

vcftools --vcf CEU.exon.2010_03.genotypes.vcf --keep keep.txt --out vcfoutput/CEU_targets --freq --recode

Where keep.txt has the list in pastebin.

And then I looked at the number of lines in the recoded file because each line should represent a SNP. It has 3489 lines without the header, but according to a paper that I am referencing (table 2) there should only be 826 between this data and HapMap data. Why are my numbers excessively high?

Thanks in advance! -Matt

EDIT: I am counting the SNPs correctly, but I don't know how to restrict my ROI. The paper states "we restricted the analysis to the 470 kb of sequence that overlapped with the exon capture boundaries of the 1000 Genomes pilot project". I'm not sure how to do this, so if anyone has some insight, it would be greatly appreciated!

1000genomes snp vcf vcftools • 3.1k views
ADD COMMENT
1
Entering edit mode
12.3 years ago
Adam ★ 1.0k

You might want to add the --maf 0.000001 option to your command in order to remove SNPs that are not polymorphic in your sample of individuals.

Regards,

Adam

ADD COMMENT
0
Entering edit mode

That cut it down to 2274, but this is still significantly higher. Thanks for your help. Any other thoughts?

ADD REPLY

Login before adding your answer.

Traffic: 2567 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6