How bcftools isec works ?
2
3
Entering edit mode
9.7 years ago
sacha ★ 2.4k

I don't understand pretty well how bcftools isec works. When I try with this command , to make the intersection between two files

bcftools isec -d output/ A.vcf.gz B.vcf.gz 

I get 3 files... instead of one.

I think I have to use -n= parameter... but it's not really clear.. Could you help me to understand with a simple example like :

A.vcf 
A
B
C
B.vcf 
B
D
E

Intersect should have: B

vcf variant intersection bcftools • 32k views
ADD COMMENT
0
Entering edit mode

Good! It's now really clear! Thanks you very much!

ADD REPLY
0
Entering edit mode

Hi all, what if we have more than two vcf files to compare, say 10 vcf files ?

I appreciate your help, thank you.

ADD REPLY
0
Entering edit mode

For intersecting more than two vcf files, you can use -n parameter (https://samtools.github.io/bcftools/bcftools.html#isec):

bcftools isec -n=3 A.vcf.gz B.vcf.gz C.vcf.gz
ADD REPLY
27
Entering edit mode
9.7 years ago

Hi,

Ok, for example, you have two files:

1) isec.a.vcf:

##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q10,Description="Quality below 10">
##test=<xx=A,yy=B,zz=C>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##readme=AAAAAA
##readme=BBBBBB
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    A
1    3062915    .    GTTT    G    1806    q10    DP=35;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:409:35:-20,-5,-20
1    3062915    .    G    T    1806    q10    DP=35;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:409:35:-20,-5,-20
1    3106154    .    CAAA    C    1792    PASS    DP=32;AN=2;AC=1    GT:GQ:DP    0/1:245:32
1    3106154    .    C    T,CT    1792    PASS    DP=32;AN=2;AC=1    GT:GQ:DP    0/1:245:32
1    3157410    .    GA    G    628    q10    DP=21;AN=2;AC=2    GT:GQ:DP    1/1:21:21
1    3162006    .    GAA    G    1016    PASS    DP=22;AN=2;AC=1    GT:GQ:DP    0/1:212:22
1    3177144    .    GT    G    727    PASS    DP=30;AN=2;AC=1    GT:GQ:DP    0/1:150:30
1    3184885    .    TAAAA    TA,T    246    PASS    DP=10;AN=2;AC=1,1    GT:GQ:DP    1/2:12:10
2    3199812    .    G    GTT,GT    481    PASS    DP=26;AN=2;AC=1,1    GT:GQ:DP    1/2:322:26
3    3212016    .    CTT    C,CT    565    PASS    DP=26;AN=2;AC=1,1    GT:GQ:DP    1/2:91:26
4    3212016    .    TACACACAC    T    325    PASS    DP=31;AN=2;AC=1    GT:GQ:DP    0/1:325:31
4    3258448    .    TACACACAC    T    325    PASS    DP=31;AN=2;AC=1    GT:GQ:DP    0/1:325:31

2) isec.b.vcf:

##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q20,Description="Mapping quality below 20">
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=243199373>
##contig=<ID=3,assembly=b37,length=198022430>
##contig=<ID=4,assembly=b37,length=191154276>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    B
1    3062915    .    G    A    376    q20    DP=14;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:376:14:-10,0,-10
1    3062915    .    GTTT    GT    376    q20    DP=14;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:376:14:-10,0,-10
1    3106154    .    C    T    677    PASS    DP=15;AN=2;AC=1    GT:GQ:DP:GL    0/1:277:15:-10,0,-10
1    3106154    .    CAAAA    C    677    PASS    DP=15;AN=2;AC=1    GT:GQ:DP:GL    0/1:277:15:-10,0,-10
1    3157410    .    GA    G    249    PASS    DP=11;AN=2;AC=1    GT:GQ:DP    0/1:49:11
1    3162006    .    GAA    G    663    PASS    DP=19;AN=2;AC=1    GT:GQ:DP    0/1:589:19
1    3177144    .    GT    G    460    PASS    DP=24;AN=2;AC=1    GT:GQ:DP    0/1:236:24
1    3184885    .    TAAA    T    598    PASS    DP=16;AN=2;AC=1    GT:GQ:DP    0/1:435:16
2    3188209    .    GA    G    162    .    DP=15;AN=2;AC=1    GT:GQ:DP    0/1:162:15
3    3199812    .    G    GTT,GT    353    PASS    DP=19;AN=2;AC=1,1    GT:GQ:DP    1/2:188:19
4    3212016    .    CTT    C    677    q20    DP=15;AN=2;AC=1    GT:GQ:DP    0/1:158:15

1. You know, that, first of all you need to bgzip them:

bgzip isec.a.vcf > isec.a.vcf.gz
bgzip isec.b.vcf > isec.a.vcf.gz

2. You should to build indexes for them:

bcftools index isec.a.vcf.gz > isec.a.vcf.gz.csi
bcftools index isec.b.vcf.gz > isec.b.vcf.gz.csi

3. Try to run isec:

bcftools isec isec.a.vcf.gz isec.b.vcf.gz -p dir

In your dir, you should see README and three output files. README tells you a little about these three files. Let's look at outputs for more details:

So, I got 0000.vcf, 0001.vcf and 0002.vcf files:

So, if you look at output files, you'd see that the first file contains variants that unique for the first input vcf file (isec.a.vcf). The second output contains variants unique for the second input file (isec.b.vcf). The third file contains variants intersected for both first and second input files (isec.a.vcf and isec.b.vcf).

Hope, it helps. Sorry for the long text ;)

ADD COMMENT
0
Entering edit mode

Please note that bcftools index does not write to STDOUT, so you don't need the > filename.vcf.gz.csi part. One can simply run bcftools index vcf_file.vcf.gz.

ADD REPLY
5
Entering edit mode
9.7 years ago
iraun 6.2k

I've never use bcftools isec for intersecting two or more vcf files but, have you tried something like this?

bcftools isec -p dir -n=2 A.vcf.gz B.vcf.gz

Using -n=2 argument, in principle, the output will be the common variants for the two input files. I guess you have already read this documentation about bcftools, but just in case that is the link.

If you are not able to get your desired output, you can use vcf-isec (I usually work with this one). It is quite similar to bcftools. In this case the command should be:

vcf-isec -n =2 A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz

Hope it helps.

ADD COMMENT

Login before adding your answer.

Traffic: 1924 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6