Entering edit mode
6.8 years ago
Titus
▴
910
Hi all ,
i'm looking for a tool/method to got the "same VCF format" to load different vcf files to my variant database (i use chromosome reference variant as primer keys). My problem concern the same variant in 2 different sample files call in 2 different ways for example :
For the first file i have :
chr3 178952506 . GGT G,GTT .
for the second file i got :
chr3 178952507 . G T .
So how can transform the first one to feet the second one in term of position and ref alt ?
Best
Hi, Though I am not sure if the above two variants are the same, there are a couple of tools to prevent such confusion. 1) GATK Left Align and Trim Variants 2) VT
Both aim to arrive at a normalized representation of the variant.
Fig.1 of the VT publication is a good representation of the confusion.
Both the above tools would additionally split multallelic variants (like the 1st one you have noted) into biallelic representations.
Thinks for the suggestions i already left aligned it and the result comes after it.
https://ibb.co/iyNqdm
As you can see on IGV i'm a bit confused, deletion seems to be covered by forward and reverse reads and G>T only forward ... I will check about the uniqueness of the sequence in the genome ( i thinks there is few pseudo genes ). other ideas are welcome :)
And i don't still know how to deal with this kind of information to my users ..
So nobody got this kind of issue even after Left Align ?
Hi, A general remark, not sure if it would be useful. Is this targeted/ amplicon sequencing data? I have observed many a times higher noise in such data. I then remove reads which do not have both pairs mapped and also those with pairs on different chr. Sometimes pseudogenes would lead to cross-mapping and false variant calls. Ensuring that both reads have mapped to the same chr. reduces that scenario.
Yes it is targeted/ amplicon sequencing data. Yes in this case i agree totally , the think is i work in single end data ( Iontorrent proton/PGM). So at the end do you exclude all positions like that :
I checked VEP traduction and it's wrong if you consider the SNV G to T correct ....
How can you say that those are the same variants? The position is different! Also, your final aim is a bit unclear:
which SNP database? what are for you "the same VCF format" files?
I edit my post to precise it's my own variant database incremented with variant called in all my samples. And i had cote to same VCF format.
Ok so in the first part of the example there is 2 variants on the same line : the first variant is a deletion of a GT and the second variant is a snv G>T on position 178952507 which correspond to the second file.
Are you calling these variants yourself with
mpileup
/bcftools call
?