I'm trying to include only single nucleotide variant, or some say SNPs, from the 1000 genomes project data. (For example, the phase 3 newest release 2013)
From bcftools manual under "view", it says:
-v, --types snps|indels|mnps|other comma-separated list of variant types to select. Site is selected if any of the ALT alleles is of the type requested. Types are determined by comparing the REF and ALT alleles in the VCF record not INFO tags like INFO/INDEL or INFO/VT. Use --include to select based on INFO tags.
I haven't checked the subset file but according to this instruction, "-v" checks the REF and ALT alleles to decide if the variant is SNP or not. I was worried that if multi-allelic single nucleotide sites will be also exluded because the ALT column will have strings with length longer than 1 (e.g. "A,T" at the ALT column).
I tried to set:
bcftools view --include 'VT=SNP'
in the INFO column, but error message popped out and say
the tag "INFO/SNP" is not defined in the VCF header
My questions are:
(1) How can I obtain only variants that have "VT=SNP" in the INFO column?
(2) Does -v snps
retain variants with low allele frequency? Since SNP means common (allele frequency > 0.1% or 0.2%) single nucleotide variant.
Thanks!
Thanks for the answer! I was using vcftools and it works great. But somehow I think bcftools works faster than vcftools. I'm confused with the setting of the command, so I will be waiting for other answers. Thanks!