Hi,
Using GATK Mutect2, we collected several variants in EGFR region.
However, one variants NM_005228.5(EGFR):c.2239_2248delinsC (p.Leu747_Ala750delinsPro) were called as two separated ones. The two variants can not combined even we set the -max-mnp-distance to 20.
7 55242465 . GGAATTAAGA G . .
AS_SB_TABLE=5567,5732|44,36;DP=11794;ECNT=2;MBQ=20,20;MFRL=188,168;MMQ=60,60;MPOS=39;NALOD=3.85;NLOD=1755.82;POPAF=4.52;TLOD=256.20 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB
0|1:2957,80:0.027:3037:1304,28:1306,51:0|1:55242465_GGAATTAAGA_G:55242465:1402,1555,44,36 0|0:8342,0:1.710e-04:8342:3598,0:4195,0:0|1:55242465_GGAATTAAGA_G:55242465:4165,4177,0,07 55242478 . G C . .
AS_SB_TABLE=5360,5529|44,36;DP=11030;ECNT=2;MBQ=20,20;MFRL=189,168;MMQ=60,60;MPOS=43;NALOD=3.93;NLOD=1686.18;POPAF=4.52;TLOD=256.12 GT:AD:AF:DP:F1R2:F2R1:PGT:PID:PS:SB
0|1:2960,80:0.027:3040:1450,28:1427,50:0|1:55242465_GGAATTAAGA_G:55242465:1403,1557,44,36 0|0:7929,0:1.846e-04:7929:3522,0:4086,0:0|1:55242465_GGAATTAAGA_G:55242465:3957,3972,0,0
It will be glad if anyone could help.
Best, Junfeng
See if you can normalize the variants when passing it through
but note that the variant is quite complex (10 deletions followed by a single insertion) and it could have different valid representations
I'd be curious to know whether
bcftools norm
would be able to resolve the issueHi Istvan,
Thanks for help.
bcftools norm did not work. Any other suggestions?
this is a tricky task because it basically asks one to merge neighboring variants that are technically correct even when listed separately. The challenge is not just merging the variants but updating all the other fields.
A quick search on "how to join neighboring variants in a VCF file" gave me this code:
I've asked the ChatGPT, as well it gave me the following:
Merging neighboring mutations in a VCF (Variant Call Format) file can be challenging because this format is designed to present each variant (mutation) individually. However, there are some ways to handle this task using Bioinformatics tools and scripting languages such as Python or R.
Here's a simplified Python script using the pysam library which can be used to merge consecutive single nucleotide polymorphisms (SNPs) from a VCF file. This script simply merges SNPs that are directly next to each other, and it might not cover all possible scenarios or more complex mutations.
This script reads a VCF file and writes the merged variants to stdout. You should redirect the output to a file when running it. Note that you might need to install the pysam package first. You can do this with pip install pysam.
Please note that merging mutations is a complex topic and can have various interpretations depending on the biological or clinical context. This script only provides a very simplistic approach to the problem. If you have complex cases, such as multiple alternative alleles, insertions, deletions, or complex rearrangements, you might want to use more advanced tools designed for this purpose, like bcftools.
Moreover, it's essential to note that merging variants arbitrarily can lead to misinformation or loss of important details, especially if you are dealing with multiallelic sites or complex events. It is always a good idea to discuss with a team or supervisor what kind of merging is biologically appropriate for your specific case.
Hi Istavan,
code from gpt is not correct, since it did not consider distance between the adjacent SNVs, and whether those sequences were matched with reference or not.
Another online code FindAdjacentSNPs.py neglect INDELs based on some reason.
Anyway, it inspired a bit to rewrite the code that can support INDELS.
Best, Junfeng
yes,this is a tricky subject, writing a generic tool for this purpose might be a publishable effort