bcftools-merge for vcf files
2
0
Entering edit mode
5.3 years ago
evelyn ▴ 230

Hello,

I have three vcf files created with bwa and bcftools. I have merged these three files using bcftools-merge command as following:

bcftools merge file1.vcf.gz fle2.vcf.gz file3.vcf.gz > out.vcf

And a snippet for file out.vcf looks like this:

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT      file1.sorted.bam    file2.sorted.bam    file3.sorted.bam
    1   687 .   G   A   42.4147 .   VDB=0.0298006;SGB=-0.556411;MQ0F=0.5;MQ=19;DP=4;DP4=0,0,0,4;AN=2;AC=2   GT:PL   ./.:.   1/1:72,12,0 ./.:.
    1   689 .   G   A   42.4147 .   VDB=0.0298006;SGB=-0.556411;MQ0F=0.5;MQ=19;DP=4;DP4=0,0,0,4;AN=2;AC=2   GT:PL   ./.:.   1/1:72,12,0 ./.:.
    1   701 .   T   A   42.4147 .   VDB=0.0221621;SGB=-0.556411;MQ0F=0.5;MQ=19;DP=4;DP4=0,0,0,4;AN=2;AC=2   GT:PL   ./.:.   1/1:72,12,0 ./.:.
    1   704 .   T   G   42.4147 .   VDB=0.0190094;SGB=-0.556411;MQ0F=0.5;MQ=19;DP=4;DP4=0,0,0,4;AN=2;AC=2   GT:PL   ./.:.   1/1:72,12,0 ./.:.
    1   708 .   C   T,A 20.4535 .   SGB=-0.379885;MQ0F=0;MQ=50;DP=2;DP4=0,0,0,2;AN=4;AC=2,2 GT:PL   1/1:50,3,0,.,.,.    ./.:.   2/2:40,.,.,3,.,0

Here, for example, for position 708, there are two ALT bases (T, A) which means at this position for reference base C, file1.sorted.bam has an ALT base, T; file2.sorted.bam has the same base as a reference that's why it has ./.:. and file3.sorted.bam has an ALT base, A.

So, using bcftools I want to make a file which looks something like this where the individual file columns have their assigned bases (both missing as reference base and an ATL base) so that I can use this vcf for further analysis:

#CHROM  ID      REF ALT     file1.sorted.bam    file2.sorted.bam    file3.sorted.bam
1       687     G   A           G   A   G
1       689     G   A           G   A   G
1       701     T   A           T   A   T
1       704     T   G           T   G   T
1       708     C   T,A         T   C   A

I have looked into other posts where GATK is used for such jobs but it also requires to use GATK to call the variants. For this analysis, I have to use bcftools for calling variants but I can use other software to merge the vcf files to get the required result.

Thank you so much for your help!

snp bcftools • 31k views
ADD COMMENT
0
Entering edit mode

So, using bcftools I want to make a vcf which looks something like this where the individual file columns have their assigned bases (both missing as reference base and an ATL base) so that I can use this vcf for further analysis:

That's not a VCF anymore, and I would strongly recommend to stick to commonly used file formats. That said, if I would have to solve your issue I would use python and the cyvcf2 module.

ADD REPLY
0
Entering edit mode

That was my mistake to say I want a vcf file. You are right it won't be a vcf anymore. I have edited my question. Thank you for the suggestion. I have never used python or the cyvcf2 module. I will try them. Thank you!

ADD REPLY
3
Entering edit mode
5.3 years ago

This will give you not the exact output as requested, but with another python/awk/... script you can do it quite easily:

bcftools +missing2ref out.vcf.vcf | bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%TGT]\n'
1   687 .   G   A   G/G A/A G/G
1   689 .   G   A   G/G A/A G/G
1   701 .   T   A   T/T A/A T/T
1   704 .   T   G   T/T G/G T/T
1   708 .   C   T,A T/T C/C A/A

How would heterozygous position will look like in your desired output?

ADD COMMENT
0
Entering edit mode

Hello @finswimmer,

Thank you for your suggestion! I just found another post which says ./.:. means a missing observation and 0/0 means the same as the reference base. So I think I need the output something like this by replacing ./.:. with - and a 0/0 replaced by reference base :

 #CHROM  ID      REF ALT     file1.sorted.bam    file2.sorted.bam    file3.sorted.bam
    1       687     G   A           -   A   -
    1       689     G   A           -   A   -
    1       701     T   A           -   A   -
    1       704     T   G           -   G   -
    1       708     C   T,A         T   -   A

Thank you for pointing out heterozygous positions. I have pasted an example from the vcf file here:

   1    6440    .   G   T   21.279  . VDB=0.0264014;SGB=-0.511536;RPB=1;MQB=1;MQSB=1;BQB=1;MQ0F=0.25;MQ=15;ICB=1;HOB=0.5;DP=7;DP4=0,2,5,0;AN=4;AC=3 GT:PL   1/1:48,5,0  0/1:35,0,19 ./.:.

Here, for file2, the call is 0/1 and I was thinking to replace heterozygous calls with an N. I did not add this in my question earlier. But is there a way to address this along with homozygous positions?

Thank you for your help!

ADD REPLY
1
Entering edit mode

just found another post which says ./.:. means a missing observation and 0/0 means the same as the reference base.

That is correct. So if you don't want to replace ./. with the reference, then you can omit the bcftools +missing2ref.

Try this to get your desired output:

input.vcf looks like this:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.9+htslib-1.9
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,length=248956422>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  file1.sorted.bam    file2.sorted.bam    file3.sorted.bam
1   687 .   G   A   42.4147 .   .   GT  ./. 1/1 ./.
1   689 .   G   A   42.4147 .   .   GT  ./. 1/1 ./.
1   701 .   T   A   42.4147 .   .   GT  ./. 1/1 ./.
1   704 .   T   G   42.4147 .   .   GT  ./. 1/1 ./.
1   708 .   C   T,A 20.4535 .   .   GT  1/1 ./. 2/2
1   6440    .   G   T   21.279  .   .   GT  1/1 0/1 ./.

The run:

$ bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT[\t%TGT]\n' merge.vcf | awk -v FS="\t" -v OFS="\t" '{for(i=6;i<=NF;i++) {split($i, gt, "/"); if(gt[1]==".") $i="-"; else if(gt[1]==gt[2]) $i=gt[1]; else $i="N";} print }'
1   687 .   G   A   -   A   -
1   689 .   G   A   -   A   -
1   701 .   T   A   -   A   -
1   704 .   T   G   -   G   -
1   708 .   C   T,A T   -   A
1   6440    .   G   T   T   N   -

The bcftools query give as the chrom, pos, id, ref, alt and the translated genotypes.

The awk script iterates over the genotypes and split them by /. If value 1 is equal to . it replaces the column with -. If value 1 is equal to value 2 it replaces the column with value 1. In all other cases it replaces the column with N.

ADD REPLY
0
Entering edit mode

Thank you so much for your help! It works great. I understood the code as well. However, I am trying to keep the vcf file names to use this file for further analysis. I used -H option from bcftools query and it gives this header:

# [1]CHROM  [2]POS  [3]ID   [4]REF  [5]ALT  N   N   N

It replaced the vcf file names with N. Thank you for your time!

ADD REPLY
0
Entering edit mode

See my answer here.

ADD REPLY

Login before adding your answer.

Traffic: 1778 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6