I need to subset/filter a SNP vcf file by a long list of non-sequential contig IDs, which appear in the CHR column. My VCF file contains 13,971 contigs currently, and I want to retain a specific set of 7,748 contigs and everything associated with those contigs (headers, all variants and genotype information etc.).
My contig list looks like:
dDocent_Contig_1
dDocent_Contig_100
dDocent_Contig_10000 etc.
I am considering the following script:
vcftools --vcf TotalRawSNPs.vcf --chr dDocent_Contig_1 --chr dDocent_Contig_100 (etc...) --recode --recode-INFO-all --out FinalRawSNPs
where I list every contig ID individually with a --chr flag before. For this --chr flag, I cannot feed it a text file of contig IDs to keep, which would be ideal. If I list all contigs individually, it'll create a massive script in the command line.
I’ve seen options for filtering by a list of individuals, but not any clear option for filtering by CHR/contig IDs only. Is there a more efficient way to filter my vcf file by CHR/contig?
The solution does not have to be strictly vcftools related. I'm open to exploring suitable awk/mawk/grep (etc.) options as well.
I have a question regarding this solution:
after concatenating all *.out into one file (FinalRawSNPs.vcf), is it really a vcf? Especially, the headers from *.out would be all over the lines. Maybe, one has to use bcftools merge like in this answer
Oh, that's a good point I didn't really think of! I think rather than catting you'd need to join them back together with some tool that understands the vcf format, like bcftools.
Thanks!
Hi, thank you both for the very helpful comments! This bash script worked out well. Since I was using vcf files, after creating the CHR output files, I bgzipped then indexed them all before using bcftools concat to combine all of the CHR files back together. Created some messy long text list of Contig IDs towards the top of the final concatenated VCF file, but all the information, headers and INFO flags were retained just fine. Thanks!