I have a vcf file with 23 chromsomes and other unwanted contigs. I want to extract a VCF file with chromsome 1 to chromsome 5 in one file. I want to include the header line as well. How can I do this in the most efficient way? Thanks
I have a vcf file with 23 chromsomes and other unwanted contigs. I want to extract a VCF file with chromsome 1 to chromsome 5 in one file. I want to include the header line as well. How can I do this in the most efficient way? Thanks
Note that this method is better than grep as it includes the VCF header. However, it won't change the header of the VCF file so the unselected chromosomes will still have their ID line, e.g ##contig=<id=chr1>. So don't rely on bcftools view -h subset.vcf
to verify what chromosomes are left in your VCF file.
Keep in mind that the posted solution only works for single-digit chromosomes, so chr1, chr2, chr3 (...), but not chr10-22 and X. Using chr[1-22] will also not work, as you have to specify to search for double digits. If you want all regular chromosomes, so 1-22 and X, but discard U, random contigs and stuff from a VCF, use:
grep -w '^#\|chr[1-9]\|chr[1-2][0-9]\|chr[X]' in.vcf
In addition to the solutions already posted, you might try VCF Tools:
http://vcftools.sourceforge.net/man_latest.html
At this URL note the following ability:
SITE FILTERING OPTIONS
These options are used to include or exclude certain sites from any analysis being performed by the program.
POSITION FILTERING
--chr <chromosome>
--not-chr <chromosome>
Includes or excludes sites with indentifiers matching <chromosome>. **These options may be used multiple times to include or exclude more than one chromosome.**
This will preserve the header of course. In addition, the code posted above in the comments will also get the header as it is getting lines with # as well as chr[1-5] (the statement includes an or that will grab lines starting with # or with chr1, chr2, chr3, etc.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
or if your chromosomes have a chr prefix:
Better extend the pattern string by #CHROM to retain the column names. If this is missing, tools like VCFtools will complain.
Hi, i am using this command line to extract chr 1 to 22, X and Y but it gets me only chr1, 2 X and Y. What is wrong?
EDIT: I just saw your solution under. So to grep all chr from 1-22 X and Y, I should do like this right?
thanks
Thanks, how can I update the vcf header?
How to split vcf file by chromosome?
Thanks, but this only extracts per chromsome, right? I want chr1 to chr5 in one file.