Question

Counting Snps From 1000 Genome Data

2

Entering edit mode

12.4 years ago

Matt W ▴ 250

I am looking at the 1000 Genome data found here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/exon/snps/

And I am trying to count the number of SNPs for CEU individuals that exist within this list: http://pastebin.com/JUwNLh9E

Each time I try, my count is almost double what is expected. I used vcftools:

vcftools --vcf CEU.exon.2010_03.genotypes.vcf --keep keep.txt --out vcfoutput/CEU_targets --freq --recode

Where keep.txt has the list in pastebin.

And then I looked at the number of lines in the recoded file because each line should represent a SNP. It has 3489 lines without the header, but according to a paper that I am referencing (table 2) there should only be 826 between this data and HapMap data. Why are my numbers excessively high?

Thanks in advance! -Matt

EDIT: I am counting the SNPs correctly, but I don't know how to restrict my ROI. The paper states "we restricted the analysis to the 470 kb of sequence that overlapped with the exon capture boundaries of the 1000 Genomes pilot project". I'm not sure how to do this, so if anyone has some insight, it would be greatly appreciated!

1000genomes snp vcf vcftools • 3.1k views

ADD COMMENT • link updated 12.3 years ago by Adam ★ 1.0k • written 12.4 years ago by Matt W ▴ 250

score 1 · Answer 1 · 2012-07-21

1

Entering edit mode

12.3 years ago

Adam ★ 1.0k

You might want to add the --maf 0.000001 option to your command in order to remove SNPs that are not polymorphic in your sample of individuals.

Regards,

Adam

ADD COMMENT • link 12.3 years ago by Adam ★ 1.0k

0

Entering edit mode

That cut it down to 2274, but this is still significantly higher. Thanks for your help. Any other thoughts?

ADD REPLY • link 12.3 years ago by Matt W ▴ 250