Generation of incorrect heterozygous calls after left normalization using bcftools
1
0
Entering edit mode
5.2 years ago
nkausthu ▴ 30

I have following records in one of the gvcf file

1       3753032 .       GTTTT   G,GT,GTTT,GTTTTT,GTTTTTT,GTTTTTTTT,<NON_REF>

1       10502954        .       CTTTTT  C,CT,CTTT,CTTTT,CTTTTTT,<NON_REF>

1       11272829        .       T       <NON_REF>

1       11272839        .       G       <NON_REF>

1       15978128        .       T       <NON_REF>

1       15978129        .       T       <NON_REF>

1       38332078        .       T    TCA,TCTCA,TCACACACACACACACACA,TC,<NON_REF>

1       67725648        .       GAAAA   G,GAA,GAAA,GAAAAA,GAAAAAA,<NON_REF>

1       72748277        .       ATT     A,AT,ATTT,ATTTT,ATTTTT,ATTTTTT,<NON_REF>

1       150782110       .       CAAAAA  C,CA,CAA,CAAA,CAAAA,<NON_REF>

1       155724315       .       GTT     G,GT,TTT,GTTTTT,GTTTTTT,<NON_REF>

1       158058266       .       CTTTTTT C,CT,CTT,CTTT,CTTTT,CTTTTT,<NON_REF>

1       201082902       .       C       CAA,CAAA,<NON_REF>

1       212618993       .       A       C,<NON_REF>

1       237955682       .       C       CGTGT,CGTGTGT,<NON_REF>

2       27532239        .       CAAA    C,CA,CAA,CAAAAAAAAAAAAAAAAA,<NON_REF>

2       47641559        .       TAAAAAA T,TA,TAA,TAAA,TAAAA,TAAAAA,<NON_REF>

2       100058714       .       CAA     C,CA,CAAA,CAAAA,CAAAAAAA,CAAAAAAAA,<NON_REF>

2       113303450       .       T       <NON_REF>

2       113303451       .       G       <NON_REF>

2       207998878       .       AT      A,ATT,ATTT,ATTTT,ATTTTT,ATTTTTT,<NON_REF>

2       231333532       .       CAAA    C,CAA,CAAAAAAA,<NON_REF>

3       42734487        .       G       <NON_REF>

3       42734750        .       C       A,<NON_REF>

3       42734751        .       C       <NON_REF>

3       47484723        .       TACACACAC       T,TAC,<NON_REF>

when we have done left normalization using bcftools after joint genotyping, lots of false heterozygous calls has been generated with no reads supporting the altered allele as follows

0/1:14,0:42:72:149,0,164

0/1:10,0:35:17:88,0,111

I guess it's due to incorrect splitting of multi-alleles. It would be great if anyone can suggest ways to remove these variants from downstream vcf file ?

bcftools left normalization multiallelic sites • 1.3k views
ADD COMMENT
0
Entering edit mode

Hello,

please provide an example dataset one can use directly for testing.

Thanks!

fin swimmer

ADD REPLY
0
Entering edit mode

I can provide vcf file after left normalization is that sufficient?

ADD REPLY
0
Entering edit mode

Hello,

that's better then nothing. But the input vcf would be more useful. Reduce it to some example lines that show your problem.

ADD REPLY
0
Entering edit mode
5.2 years ago

Hey,

With multi-allelic sites, I think that it is better to do this in a 2 step process, and you must also make use of the reference genome against which the variants were originally called.

So, something like this:

# 1st pipe, splits multi-allelic calls into separate variant calls
# 2nd pipe, left-aligns indels and issues warnings when the REF base in your VCF does not match the base in the supplied FASTA reference genome
bcftools norm -m-any myvariants.vcf | \
  bcftools norm -Ob --check-ref w -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta \
  > myvariants_norm.bcf ;

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1853 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6