best way of filtering a VCF file using a list of SNP IDs and ref/alt alleles
2
0
Entering edit mode
8.1 years ago
auraf85 ▴ 20

Hi,

I need to filter a VCF file keeping only those SNPs that match with a separate list containing 3 columns: their ID, their reference allele and their alternate allele.

I am very new to this kind of procedure so I am trying to understand the most effective strategy to work on this.

I have been suggested to use VCFtools or BCFtools, but I am not sure I can select variants also on the basis of their ref/alt alleles. Is it possible to do this just using the command line?

Thank you

vcf bcftools • 5.8k views
ADD COMMENT
2
Entering edit mode
8.1 years ago

If your separate list with IDs is formatted the same way as the VCF, then a simple grep should work:

grep -f ID.list full.vcf > filtered.vcf

Edit: Just realized that this command will remove the headers. Quickest solution is to add a line at the top of the ID.list that has the '#' character.

ADD COMMENT
0
Entering edit mode

hey, this works but it takes a very long time. What I did instead was adding reference and alternate allele letters to SNP id column and then use VCFtools to make selection.

ADD REPLY
2
Entering edit mode
8.1 years ago

gatk https://software.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

Select IDs in fileKeep and exclude IDs in fileExclude:

 java -jar GenomeAnalysisTK.jar \
   -R ref.fasta \
   -T SelectVariants \
   --variant input.vcf \
   -o output.vcf \
   -IDs fileKeep \
   -excludeIDs fileExclude
ADD COMMENT

Login before adding your answer.

Traffic: 1497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6