Hi,
Ok, for example, you have two files:
1) isec.a.vcf
:
##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q10,Description="Quality below 10">
##test=<xx=A,yy=B,zz=C>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##readme=AAAAAA
##readme=BBBBBB
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A
1 3062915 . GTTT G 1806 q10 DP=35;DP4=1,2,3,4;AN=2;AC=1 GT:GQ:DP:GL 0/1:409:35:-20,-5,-20
1 3062915 . G T 1806 q10 DP=35;DP4=1,2,3,4;AN=2;AC=1 GT:GQ:DP:GL 0/1:409:35:-20,-5,-20
1 3106154 . CAAA C 1792 PASS DP=32;AN=2;AC=1 GT:GQ:DP 0/1:245:32
1 3106154 . C T,CT 1792 PASS DP=32;AN=2;AC=1 GT:GQ:DP 0/1:245:32
1 3157410 . GA G 628 q10 DP=21;AN=2;AC=2 GT:GQ:DP 1/1:21:21
1 3162006 . GAA G 1016 PASS DP=22;AN=2;AC=1 GT:GQ:DP 0/1:212:22
1 3177144 . GT G 727 PASS DP=30;AN=2;AC=1 GT:GQ:DP 0/1:150:30
1 3184885 . TAAAA TA,T 246 PASS DP=10;AN=2;AC=1,1 GT:GQ:DP 1/2:12:10
2 3199812 . G GTT,GT 481 PASS DP=26;AN=2;AC=1,1 GT:GQ:DP 1/2:322:26
3 3212016 . CTT C,CT 565 PASS DP=26;AN=2;AC=1,1 GT:GQ:DP 1/2:91:26
4 3212016 . TACACACAC T 325 PASS DP=31;AN=2;AC=1 GT:GQ:DP 0/1:325:31
4 3258448 . TACACACAC T 325 PASS DP=31;AN=2;AC=1 GT:GQ:DP 0/1:325:31
2) isec.b.vcf
:
##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q20,Description="Mapping quality below 20">
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=243199373>
##contig=<ID=3,assembly=b37,length=198022430>
##contig=<ID=4,assembly=b37,length=191154276>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT B
1 3062915 . G A 376 q20 DP=14;DP4=1,2,3,4;AN=2;AC=1 GT:GQ:DP:GL 0/1:376:14:-10,0,-10
1 3062915 . GTTT GT 376 q20 DP=14;DP4=1,2,3,4;AN=2;AC=1 GT:GQ:DP:GL 0/1:376:14:-10,0,-10
1 3106154 . C T 677 PASS DP=15;AN=2;AC=1 GT:GQ:DP:GL 0/1:277:15:-10,0,-10
1 3106154 . CAAAA C 677 PASS DP=15;AN=2;AC=1 GT:GQ:DP:GL 0/1:277:15:-10,0,-10
1 3157410 . GA G 249 PASS DP=11;AN=2;AC=1 GT:GQ:DP 0/1:49:11
1 3162006 . GAA G 663 PASS DP=19;AN=2;AC=1 GT:GQ:DP 0/1:589:19
1 3177144 . GT G 460 PASS DP=24;AN=2;AC=1 GT:GQ:DP 0/1:236:24
1 3184885 . TAAA T 598 PASS DP=16;AN=2;AC=1 GT:GQ:DP 0/1:435:16
2 3188209 . GA G 162 . DP=15;AN=2;AC=1 GT:GQ:DP 0/1:162:15
3 3199812 . G GTT,GT 353 PASS DP=19;AN=2;AC=1,1 GT:GQ:DP 1/2:188:19
4 3212016 . CTT C 677 q20 DP=15;AN=2;AC=1 GT:GQ:DP 0/1:158:15
1. You know, that, first of all you need to bgzip them:
bgzip isec.a.vcf > isec.a.vcf.gz
bgzip isec.b.vcf > isec.a.vcf.gz
2. You should to build indexes for them:
bcftools index isec.a.vcf.gz > isec.a.vcf.gz.csi
bcftools index isec.b.vcf.gz > isec.b.vcf.gz.csi
3. Try to run isec:
bcftools isec isec.a.vcf.gz isec.b.vcf.gz -p dir
In your dir, you should see README and three output files. README tells you a little about these three files. Let's look at outputs for more details:
So, I got 0000.vcf
, 0001.vcf
and 0002.vcf
files:
So, if you look at output files, you'd see that the first file contains variants that unique for the first input vcf file (isec.a.vcf
). The second output contains variants unique for the second input file (isec.b.vcf
). The third file contains variants intersected for both first and second input files (isec.a.vcf
and isec.b.vcf
).
Hope, it helps. Sorry for the long text ;)
Good! It's now really clear! Thanks you very much!
Hi all, what if we have more than two vcf files to compare, say 10 vcf files ?
I appreciate your help, thank you.
For intersecting more than two vcf files, you can use -n parameter (https://samtools.github.io/bcftools/bcftools.html#isec):