Best way to compare two samples in a VCF file
4
14
Entering edit mode
8.1 years ago

It might be a very stupid question for many of you but, since it's my first variant calling, I didn't figure it out yet.

I have mpileup'ped two bam files from two samples, then I filtered the results with vcfutils.pl and called the genotypes with bcftools call. Now I have a VCF file containing what I want, but I managed to analyze differences only between the two samples and the reference (which is an assembly coming from a line that is different from both samples).

I have the variants between my two samples and the assembly, but what if I want to detect the differences between the two samples? I did it with awk / sed / cut and other command line tools but is there maybe a better and more straight-forward way to do that? Perhaps using bcftools? Up to know, I didn't find it.

Any suggestion appreaciated!

EDIT: I did it with bcftools gtcheck as well, that works fine. I am asking if there are other reliable tools to test!

EDIT 2: As this post has many views now, I guess it will be useful for everyone to know that I used bcftools isec and it worked brilliantly.

VCF Variant Genotype Comparison SNP • 24k views
ADD COMMENT
10
Entering edit mode
8.1 years ago
Len Trigg ★ 1.6k

If you are looking for a good way of separating the variants into those that are shared between both samples and those that are unique to each sample, I would recommend using vcfeval from RTG Tools to do the comparison, as that will correctly deal with representation differences between the two samples (this may not be such an issue if both samples were called jointly or you are only concerned with simple SNPs). For example:

rtg vcfeval -t your_reference -b sample1.vcf.gz -c sample2.vcf.gz --squash-ploidy --sample sample1,sample2 -o sample-differences

If both of your samples are in a single VCF, just use the same VCF for both -b and -c. The --squash-ploidy option will ignore differences in zygosity in the genotypes and match based on shared alleles. In the output directory you will have variants unique to sample1 in sample-differences/fn.vcf.gz and those unique to sample2 in sample-differences/fp.vcf.gz, and you can get a summary of the composition of those variants using rtg vcfstats.

(disclaimer: I work for RTG, but vcfeval is free and awesome)

ADD COMMENT
10
Entering edit mode
8.1 years ago
novice ★ 1.1k

I was recently in a very similar situation. I wrote a bash script that finds the "similarity" between two VCF files. Someone might find it useful:

#!/usr/bin/env bash

sort -u <(grep -v '^#' $1 | cut -f1,2,4,5) > a
sort -u <(grep -v '^#' $2 | cut -f1,2,4,5) > b
comm -23 a b > a_only
comm -13 a b > b_only
comm -12 a b > ab

numer=`cat a_only b_only | wc -l`
denom=`cat ab a_only b_only | wc -l`
dist=`echo "$numer/$denom" | bc -l`
sim=`echo "(1-$dist)*100" | bc -l`

echo $sim

# CAREFUL
rm -f a_only b_only a b ab

If you put it in sim.sh then you can use it like so:

$ ./sim.sh sample_1.vcf sample_2.vcf
ADD COMMENT
7
Entering edit mode
8.1 years ago

depending on what you're looking for, like simple counts for instance, awk/perl solutions could be enough. if you need a deeper description, bcftools stats would give you some interesting details.

if your 2 samples are in a single vcf file I would suggest to split them first

bcftools view -Oz -c1 -s sample1 joined.vcf.gz > sample1.vcf.gz
bcftools view -Oz -c1 -s sample2 joined.vcf.gz > sample2.vcf.gz

and then compare them

bcftools stats sample1.vcf.gz sample2.vcf.gz > joined.stats.txt
ADD COMMENT
4
Entering edit mode
8.1 years ago

I'm a big fan of vt for this.

http://genome.sph.umich.edu/wiki/Vt#Peek

#summarizes the variants found in sample.vcf

 vt peek sample.vcf

Also bedtools is very flexible and can work with VCF files so I would use bedtools intersect for this task.

Others have mentioned database joins to be useful too: http://www.gettinggeneticsdone.com/2013/11/using-database-joins-to-compare-results.html

ADD COMMENT

Login before adding your answer.

Traffic: 2528 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6