There is a standard HGVS notation to report mutations using transcripts or protein sequences as a reference. Here is the old version http://www.hgvs.org/mutnomen/ and here is updated version http://varnomen.hgvs.org/ Your example codes are in line with at least one of these standards. Only this one bothers me:
c.G1514A
I do not remember this style as a standard notation.
For everything else, it is possible to write a script to make it look like HGVS standard, but it is going to be very hard to make sure that mutation description is actually correct HGVS notation because HGVS has certain rules for regions with repeats, for mutations in splice sites and so on. For example consider CTATATAG on the forward strand of DNA changed to CTATAG now the question is what to write in the notation: deletion of first TA, the second TA or the third TA? To the end user and in databases c.23_24delTA and c.27_28delTA look as different mutations but as in the example, the result of the mutation is the same. Thus HGVS notation has to provide a standard on what you select. Because of this, there are two options to consider from:
- look in your data if you have chromosome, position, reference and alternative allele data with a particular reference for each mutation and if reference assemblies are different, use liftover tool from UCSC to convert to a single one and use a tool that creates HGVS notations using reference transcripts from your metadata after
- write a tool that makes your mutation codes look like a proper HGVS notation (substitute single amino acid codes with three lettered ones, remove text after _del and so on).
The second approach might be ok, but it is not guaranteed to give you proper HGVS notation for the reasons described above, so I would go with the first option.
If you do not have original chromosome, position, reference and alternative allele data for mutations, then you can try to guestimate them, but this a not easy or even impossible like for mutations on a protein level.
Moreover, your notation is using certain versions of transcripts and proteins. If the data is old, most likely some of the sequences got updated and this can change the notation (a rare event).
Once you have them standardised, what would the next step be? These are the HGVS notations. Do you know which genes they map to? If you have something like 5:g.140532T>C or NM_153681.2:c.7C>T or ENST00000285667.3:c.1047_1048insC or NP_000020.1:p.Met268Thr, you can use the Variant Effect Predictor to get them all either as genomic coordinates or known IDs such as rsXXXXX.
I don't think it's necessary to use the variant effect predictor. I believe the hgvs python package is the somewhat official code to parse HGVS (http://hgvs.readthedocs.io/ ).
This is a very good tool, however, in order to use it mutations have to be in HGVS mutnomen standard and this might not be the case. Also, c.G1514A is not HGVS standard as far as I remember.
Yes I already have all the associated meta data (genes, transcripts, etc), but I'm trying to get all these IDs into a consistent format to standardize my dataset