How to extract specific chromosome from vcf file
3
15
Entering edit mode
8.4 years ago
MAPK ★ 2.1k

I have a vcf file with 23 chromsomes and other unwanted contigs. I want to extract a VCF file with chromsome 1 to chromsome 5 in one file. I want to include the header line as well. How can I do this in the most efficient way? Thanks

vcf • 60k views
ADD COMMENT
6
Entering edit mode
grep -w '^#\|^[1-5]' my.vcf > my_new.vcf

or if your chromosomes have a chr prefix:

grep -w '^#\|^chr[1-5]' my.vcf > my_new.vcf
ADD REPLY
4
Entering edit mode

Better extend the pattern string by #CHROM to retain the column names. If this is missing, tools like VCFtools will complain.

grep -w '^#\|^#CHROM\|^chr[1-5]' my.vcf > my_new.vcf
ADD REPLY
0
Entering edit mode

Hi, i am using this command line to extract chr 1 to 22, X and Y but it gets me only chr1, 2 X and Y. What is wrong?

grep -w '^#\|^#CHROM\|^chr[1-22,X,Y]' in.vcf > out.vcf

EDIT: I just saw your solution under. So to grep all chr from 1-22 X and Y, I should do like this right?

grep -w '^#\|^#CHROM\|chr[1-9]\|chr[1-2][0-9]\|chr[X]\|chr[Y]'

thanks

ADD REPLY
0
Entering edit mode

Thanks, how can I update the vcf header?

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks, but this only extracts per chromsome, right? I want chr1 to chr5 in one file.

ADD REPLY
15
Entering edit mode
6.7 years ago
mg ▴ 290

bcftools can be used, and this will preserve the header as well.

bcftools view input.vcf.gz --regions chr1

To extract mutiple chromosomes pass them as comma separated. eg. --regions chr1,chr5

ADD COMMENT
1
Entering edit mode

Note that this method is better than grep as it includes the VCF header. However, it won't change the header of the VCF file so the unselected chromosomes will still have their ID line, e.g ##contig=<id=chr1>. So don't rely on bcftools view -h subset.vcf to verify what chromosomes are left in your VCF file.

ADD REPLY
1
Entering edit mode

This worked well for me, too! For many chromosomes, do: -R, --regions-file <file> restrict to regions listed in a file

ADD REPLY
5
Entering edit mode
7.3 years ago
ATpoint 86k

Keep in mind that the posted solution only works for single-digit chromosomes, so chr1, chr2, chr3 (...), but not chr10-22 and X. Using chr[1-22] will also not work, as you have to specify to search for double digits. If you want all regular chromosomes, so 1-22 and X, but discard U, random contigs and stuff from a VCF, use:

grep -w '^#\|chr[1-9]\|chr[1-2][0-9]\|chr[X]' in.vcf
ADD COMMENT
0
Entering edit mode

Hi I want to separate only chr21 from vcf how to do it. I tried above commands its not generating. can any one help me. Thanks for your suggestions.

ADD REPLY
0
Entering edit mode

You could do it from compressed VCFs also: zgrep -w '^CHRNAME' compressed.vcf.gz > out.vcf and this can be done in parallel zcat compressed.vcf.gz | parallel --pipe --block 2M grep {1} ::: -w '^CHRNAME' > subsampled.vcf

ADD REPLY
4
Entering edit mode
8.4 years ago
LauferVA 4.5k

In addition to the solutions already posted, you might try VCF Tools:

http://vcftools.sourceforge.net/man_latest.html

At this URL note the following ability:

    SITE FILTERING OPTIONS
    These options are used to include or exclude certain sites from any analysis being performed by the program.

    POSITION FILTERING

    --chr <chromosome> 
    --not-chr <chromosome>

Includes or excludes sites with indentifiers matching <chromosome>. **These options may be used multiple times to include or exclude more than one chromosome.**

This will preserve the header of course. In addition, the code posted above in the comments will also get the header as it is getting lines with # as well as chr[1-5] (the statement includes an or that will grab lines starting with # or with chr1, chr2, chr3, etc.

ADD COMMENT
0
Entering edit mode

Can you do something like --chr 1-23 ?

ADD REPLY
0
Entering edit mode

This won't work, separating chromosome names by commas neither (though it works for bcftools view --regions 1,2,3). Do rather vcftools --gzvcf <file.vcf.gz> --chr 1 --chr 2 [ etc. until 23] --recode --out subset_chr1-23

ADD REPLY

Login before adding your answer.

Traffic: 1631 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6