Hello,
I'm trying to annotate variations in NGS data from bacterial artificial chromosomes with respect to the reference sequence.
To do this i build a map of the BAC (including vector) and map the NGS reads to this BAC map. I also use a variant caller to find any differences to my reference map. These are then curated by hand until all variations have been correctly called.
What I would like to is to check these variants against known SNP-IDs.
To this I downloaded the dbSNP135 in bigbed format, and used the genomic positions of my reference DNA to extract all SNVs for this region from the database using bigBedToBed. I can then assign rsxxxxx IDs by matching the positions for SNVs as the positions are unambiguous using custom scripts (I'm using python, but that should not matter much).
BUT: for del/ins and variations at homopolymers it is more difficult since the position can be ambiguous and I might therefore miss the match. Example: an addional T in GATTTTACG could be either GA-T-TTTTACG ord GATTTT-T-ACG, which seems to be dependent on the strand of the reference as variant callers tend to place such SNVs at the 5' End. If however my reference is on the minus strand in comparison to the genome reference (hgh19 or 38) (because it was used revcom so that the gene of interest contained in the BAC is 5'-->3') the SNV in dbSNP153 does not match. Additionally at least for insertions it can either be class "ins" --> Ref "-" Alt "A", but could also be delins like Ref "GG" Alt "GGA". For homopolymers it seems that the entire homopolymer is usually (or sometimes?) the Ref and would run under delins.
I'm a little bit lost on how to deal with these ambiguities. Are there any specific rules as to when an insertion/deletion is either ins or del or delins? and what are the rules for the genomic positions of such variations.
I have also seen many people here recommending Tools like SNPeff, AnnoVar, Ensembl VEP, but since those need vcf or BED input I assume they would also fail if the positions don't match.
Any hint on how to tackle this problem would be highly appreciated!
Many thanks Hagen