Extract Sub-Set Of Regions From Vcf File
8
15
Entering edit mode
12.5 years ago
Rubal7 ▴ 850

Hello all,

Sorry if this has been answered before (though I have not found an appropriate answer already) What is an effective way to extract mutliple chromosomal regions from a vcf.gz file? I see that it can be done for single regions using tabix, but I have a file with several regions in a text file in the format:

chromosome:startposition-endposition eg: 19:47366525-47380539

Perhaps there is a way to refer to a file of regions using tabix, or vcf-tools? If anyone could provide an example command line that would be really helpful.

Thanks in advance!

vcf tabix genome filter • 94k views
ADD COMMENT
0
Entering edit mode
$ echo -e "chrN\t123\t456" | bedops -e 1 <(vcf2bed variants.bed) - > answer.bed
ADD REPLY
0
Entering edit mode

Use

tabix my.vcf.gz chromosome_name:start_position-end_position -h > out.vcf

ADD REPLY
26
Entering edit mode
12.5 years ago

I find tabix to be much easier / much sorter command line arguments.

You don't need a fasta for this method. Tabix is screaming fast.

bgzip your.vcf
tabix -p vcf your.vcf
tabix your.vcf.gz chr1:10,000,000-20,000,000
ADD COMMENT
1
Entering edit mode

None of these work anymore.

ADD REPLY
7
Entering edit mode
tabix -R region.txt my.vcf.gz

worked for me

ADD REPLY
1
Entering edit mode

Would anyone be able to let me know what the input is for the region.txt file? I have put 57646425-57803821 (all in one line). Linux is telling me could not parse bed line. Thank you for your help.

ADD REPLY
0
Entering edit mode

Have you tried to add the chromosome position in the beginning? eg: chr1:57646425-57803821

Another way is separate by tab. eg: chr1 57646425 57803821

ADD REPLY
0
Entering edit mode

Hello geocarvalho, thank you for the suggestion. I'll give that a try. Really appreciate your help. Julia

ADD REPLY
0
Entering edit mode

Thanks so much this worked :)

ADD REPLY
0
Entering edit mode

Thanks, I agree it works fast. Shame it doesn't take a file with a whole list of regions. But I can put it into a loop.

ADD REPLY
5
Entering edit mode

Two ways: 1) if there are many regions: tabix -fB my.vcf.gz reg.bed 2) if you only have a handful of regions: awk '{print $1":"($2+1)"-"$3}' reg.bed | xargs tabix my.vcf.gz.

ADD REPLY
8
Entering edit mode

For anyone arriving years later, the command appears to have changed. The command you probably want is: tabix -R reg.bed my.vcf.gz.

ADD REPLY
23
Entering edit mode
11.0 years ago
gourneau ▴ 260

For others if you like to use BED files, one way to do this would be like so:

vcftools --vcf input.vcf --bed bed_file_describing_the_range.bed --out output_prefix --recode --keep-INFO-all

Where bed_file_describing_the_range.bed has the positions of interest.

ADD COMMENT
0
Entering edit mode

I am trying to extract specific regions from a VCF file that has SNP information for 16 individuals. My bed file has three columns (scaffold name, start of region, end of region).

When I run vcftools on these files, it runs through the entire vcf file and determines that there are no regions from the bed file contained in the vcf file. A recode vcf file is output, but it only contains the header information from the vcf file.

ADD REPLY
0
Entering edit mode

Check to ensure your scaffold names match in the BED and VCF files. For example, if your BED file has "chr1" but your VCF file just has "1" for chromosome 1 it will return an empty file.

ADD REPLY
0
Entering edit mode

i met the same problem with u , so I am wondering how did you solved this problem?

ADD REPLY
1
Entering edit mode

Answering to help others you are going to need to use sed on the bed file ie sed 's/chr//g' {in_path} > {out_path}

ADD REPLY
9
Entering edit mode
7.2 years ago
geocarvalho ▴ 390

bedtools intersect worked better to me

intersectBed -a input.vcf -b /path/to/my.interval.bed -header > output.vcf
ADD COMMENT
0
Entering edit mode

For some reason by intersected out VCF have more line compare to input vcf. :( :( No idea what could be the possible reason for this !! Any guess ?

ADD REPLY
5
Entering edit mode
2.8 years ago
ashotmarg2004 ▴ 130

Another straightforward option would be with bcftools: https://samtools.github.io/bcftools/bcftools.html#view

As a general example like this: bcftools view -R yourRegionFile -o outFile yourVCFfile

Interestingly the region file "yourRegionFile" can be specified either on command line or in a VCF, BED, or tab-delimited file (the default). See the bcftools manual for regions file here: https://samtools.github.io/bcftools/bcftools.html#common_options

ADD COMMENT
0
Entering edit mode

Nowadays I use this option docker run -v $PWD:$PWD -w $PWD biocontainers/bcftools:v1.9-1-deb_cv1 bcftools view-R yourRegionFile -o outFile yourVCFfile

ADD REPLY
2
Entering edit mode
20 months ago
bcftools view -R regions.bed your.vcf > output.regions.vcf
ADD COMMENT
1
Entering edit mode
12.5 years ago
Johan ▴ 890

You could use the GATK SelectVariants Walker, specifying the -L option to set the intervals your are interested in.

Here is an example of how to do that, taken from the link above:

#Select a sample and restrict the output vcf to a set of intervals:     
java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T SelectVariants \
       --variant input.vcf \
       -o output.vcf \
       -L /path/to/my.interval_list \
       -sn SAMPLE_1_ACTG

Note that you have to have a fasta reference file with the same names and coordinates as you use in our region file and vcf, and that you can remove the -sn option if you want to extract the variations for all samples.

Hope this helps.

ADD COMMENT
0
Entering edit mode

thanks, I can potentially use this, but having some issues with the reference fasta, so a method that does not require a fasta reference would be preferred.

ADD REPLY
0
Entering edit mode

I'm not sure if it requires uncompressing the vcf files. I'm not that familiar with vcftools, so I couldn't say if there is an easier way to do it using that suite.

ADD REPLY
1
Entering edit mode
11.0 years ago

I did something like what you describe, but to create multiple VCF files. Uses bedtools:

ADD COMMENT

Login before adding your answer.

Traffic: 1360 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6