intersect VCF files
4
7
Entering edit mode
8.8 years ago
Bogdan ★ 1.4k

Dear all,

please would appreciate a quick recommendation about the best way to intersect 2 VCF files. many thanks,

bogdan

VCF • 26k views
ADD COMMENT
0
Entering edit mode

Do you intent to compare the VCFs for similarities? i.e. SNPs common to both?

ADD REPLY
0
Entering edit mode

thanks gentlemen !

ADD REPLY
16
Entering edit mode
8.8 years ago
Mitch Bekritsky ★ 1.3k

bcftools isec is very easy to use. With a sample command line like

bcftools isec -p isec_output -Oz 1.vcf.gz 2.vcf.gz

You can get four output files in gzipped VCFs

  1. isec_output/0000.vcf.gz would be variants unique to 1.vcf.gz
  2. isec_output/0001.vcf.gz would be variants unique to 2.vcf.gz
  3. isec_output/0002.vcf.gz would be variants shared by 1.vcf.gz and 2.vcf.gz as represented in 1.vcf.gz
  4. isec_output/0002.vcf.gz isec_output/0003.vcf.gz would be variants shared by 1.vcf.gz and 2.vcf.gz as represented in 2.vcf.gz

It's also faster than vcftools :)

ADD COMMENT
3
Entering edit mode

I also would like to add that in order to obtain a vcf file containing information for common SNPs for samples from file 1.vcf.gz and 2.vcf.gz the following command should be used:

bcftools merge --merge all 0002.vcf.gz 0003.vcf.gz > merged.vcf
ADD REPLY
0
Entering edit mode

For this final step (creating a VCF with the variants in both VCFs with the samples from both VCFs), is there any difference between using bcftools merge and GATK CombineVariants?

ADD REPLY
0
Entering edit mode

Unfortunately, I have not used GATK for this purpuses.

ADD REPLY
1
Entering edit mode

thanks Mitch !

ADD REPLY
0
Entering edit mode

This might be bit naive. Sorry. Why there is difference in intersection size in last output files (3 and 4). According to the commutative property, intersection size will be the same. Thanks

ADD REPLY
0
Entering edit mode

Like the description says,

3. isec_output/0002.vcf.gz would be variants shared by 1.vcf.gz and 2.vcf.gz as represented in 1.vcf.gz
4. isec_output/0002.vcf.gz would be variants shared by 1.vcf.gz and 2.vcf.gz as represented in 2.vcf.gz

The matching variant entries in each file may (and in all probability do) have different INFO, QUAL, ID fields etc. and thus the information content of the output would be different. The criteria for a "match" are matching CHROM, POS and optionally REF and ALT or ID (based on the -c). Commutative property does not apply to sets of complex data-types with partial attribute overlap.

EDIT: The fourth file should be isec_output/0003.vcf.gz. I've made this change in the answer as well.

ADD REPLY
0
Entering edit mode

If the matches are based on CHROM and POS columns, then it would ignore other fields which are not same. Right For example:

File 1
ID QUAL CHROM POS
2 52        7              14231

File 2
ID QUAL CHROM POS
3 56         7            14231

In the above example, though ID and QUAL values are different, it will be counted as one variant. Right?. So why there is difference in the number of variants. Sorry, can you help me to understand with an example here.

Thank you

ADD REPLY
0
Entering edit mode

Which ID and QUAL values do you pick for the output VCF record? That choice makes the difference.

ADD REPLY
0
Entering edit mode

Thank you. I could confirm the output by generating a statistics file with number of records, number of SNPs and indels in each file and it is the same in both vcf files

ADD REPLY
0
Entering edit mode

Of course, the number of records will match, that's what an intersect operation does.

I read the answer and comments more thoroughly and I see a possible source of confusion: both the third and fourth files are named isec_output/0002.vcf.gz - the fourth should be isec_output/0003.vcf.gz (3 not 2).

With that in mind, applying the operation to your example, 0002 would be:

ID QUAL CHROM POS
2 52        7              14231

with ID and QUAL from File 1 and 0003 would be:

ID QUAL CHROM POS
3 56         7            14231

with ID and QUAL from File 2.

ADD REPLY
10
Entering edit mode
8.8 years ago
Len Trigg ★ 1.6k

Unfortunately, vcftools, bcftools, and bedops all completely ignore the fact that in many cases there are multiple equivalent ways to express the same variant, so your input vcfs may look different but actually be saying the same thing about the underlying haplotypes. You can alleviate some of this by applying normalization, but you should cut to the chase by using a haplotype-aware vcf comparison tools such as rtg vcfeval, hap.py, or vgraph. I would recommend vcfeval as it is the most straight forward for using in an intersection type use-case (much like bcftools isec, you get separate output VCFs containing the unique and shared variants, preserving the input representations and annotations).

ADD COMMENT
1
Entering edit mode

You should mention that you work for RTG when recommending the software - however its open source and really awesome, so thank you! :)

ADD REPLY
0
Entering edit mode

I tried using rtg vcfeval and dear god is it a pain to install the first time! curl -O on a HPC gives me no way of unzipping the file, no matter what tool I use, and once I download to local Mac+unzip+scp, the SDF file for the actual intersect operation is either to be created from scratch or the pre-formatted databases download at the rate of 1 gig/4 hours. If this is the initial investment that goes into using this tool, why would I use it?

ADD REPLY
1
Entering edit mode

Sorry to hear you had difficulty. I had no idea that curl has problems with bog standard github release download links (wget handles these fine). Apparently you need to give curl the -L option, e.g:

$ curl -L -O https://github.com/RealTimeGenomics/rtg-tools/releases/download/3.7.1/rtg-tools-3.7.1-nojre.zip
$ unzip -q rtg-tools-3.7.1-nojre.zip
$ ./rtg-tools-3.7.1/rtg  # answer y or n when prompted

Regarding the SDF creation, it's trivial difficult to make your own, and is no harder than creating a BWA reference index or faidx index, e.g:

$ ./rtg-tools-3.7.1/rtg format -o hs37d5.sdf /path/to/hs37d5.fa.gz
ADD REPLY
0
Entering edit mode

Thank you - I will try rtg the next time I have to intersect VCF files. In the meantime, I'll try and prep the SDF file.

ADD REPLY
0
Entering edit mode

I installed rtg-tools 3.8.4 using bioconda. When I run vcfeval the output contains 4 VCFs:

fn.vcf.gz
fp.vcf.gz
tp.vcf.gz
tp-baseline.gz

Judging from the slides and the preprint, the fn (false negative) file contains calls that are in the baseline but not in the calls file, the fp (false positive) file contains calls that are in the calls but not in the baseline files and the tp (true positive) contains variants which are represented in both. Is this correct?

What then is the tp-baseline file?

And I see from the log that variants with alleles over 1000 bases are skipped. Is this cutoff limit configurable somehow?

BTW I'm using this to compare variants in a bacterial genome.

ADD REPLY
0
Entering edit mode

tp.vcf contains the called variants which are present in both.
tp-baseline.vcf contains the baseline variants which are present in both.

So the tp.vcf is equivalent to the tp-baseline.vcf, but the former is using the calls representation, and the latter is using the baseline representation. If you haven't already looked, there is more information in the user manual section for vcfeval (which unfortunately you don't get with bioconda).

You can adjust the maximum allele length with --Xmax-length=N. --X flags usually require caution when using. In the case of allele length, synchronization points (since you've read the preprint) only occur where there is no variant in play, so if you have very long alleles as well as many short alleles within the same span (in either vcf), computational complexity can blow out. It's probably less of a concern with bacterial variants, since you'll presumably either have haploid variants or be using --squash-ploidy.

ADD REPLY
5
Entering edit mode
8.8 years ago

You could use BEDOPS bedops for this. There are a couple approaches, depending on what you mean by intersect.

There is the set operation on elements, which is probably what most mean by "intersect":

$ bedops --element-of 1 <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed

An element-of operation is not a symmetric result — in this case, the result contains elements of one.vcf that overlap elements in two.vcf. You would reverse the order of inputs to get the other result.

Alternatively, and perhaps less commonly, you could do an operation on genomic regions:

$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed

The by-region operation is symmetric; inputs can be entered in any order to get the same result.

If you want VCF elements "common" to or overlapping between both input files, you can do the following series of operations:

$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > common-regions.bed
$ bedops --everything <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > all-elements.bed
$ bedops --element-of 1 all-elements.bed common-regions.bed > common-elements.bed
ADD COMMENT
4
Entering edit mode
8.8 years ago
venu 7.1k

Quickly recommended vcftools

ADD COMMENT
1
Entering edit mode

Just important to reiterate what Len mentioned below - vcftools, bcftools, and bedops all will NOT get the right answer in many cases! Even with careful normalization (for instance, using vt or vcfallelicprimitives), there will still be equivalent / matching variants that are not appropriately handled. As of today, the only tools that do intersections correctly are vgraph, vcfeval, and hap.py.

ADD REPLY

Login before adding your answer.

Traffic: 1649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6