Hi,
When exploring datasets of published papers, more often that what I'd like I realize that the genes present in the dataset, and the annotation file version mentioned in the materials & methods do not match.
Usually I end up with a list of genes present in the data but that do not exist in the annotation version that was supposedly used. Then it is my turn to enter into a spiral of testing different ensembl/gencode versions until I find the version that misses the least genes (spoiler: there is never a perfect match).
Is there an online tool that lets you enter a list of gene names or ids, and returns a list of annotation versions where they are present?
Edit: I've found this tool to search for synonymous gene names https://www.genenames.org/tools/multi-symbol-checker/ which is another common problem in this. However, it doesn't tell you when did the name change happen, so you cannot start digging for annotation versions older than that.
being able to effectively map between annotations of different kinds is among the most necessary skill sets in bioinformatics. id start with a comprehensive resource, e.g. https://biostar.myshopify.com/
Thanks for your comment.
However, I fail to see which one of these courses can help me identify which version of the annotation files contains a deprecated gene symbol or alias. Or which source was really used and then misreported in the material and methods of the article. Could you guide me through the linked courses which one can help me with this kind of issue?
What I am facing right now is a dataset with ~3800 genes supposedly annotated against Gencode v33. The dataset contains 20 genes using a deprecated alias or symbol not present in the annotation file (I could match them to the symbols in annotation file using https://www.genenames.org/tools/multi-symbol-checker/ , ensembl, and google), and 10 more genes that cannot be resolved.
For example:
the dataset contains an entry for the DUXAP10 gene. Such gene is not present in gencode v33.
The current version of Ensembl lists LNMAT1 as an alias for DUXAP10.
LNMAT1 is also not present in gencode v33.
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:32188 lists LNMAT1 as an alias of DUXAP9. Ensembl lists LNMAT1 as an alias of both DUXAP9 and DUXAP10.
DUXAP9 is indeed present in Gencode v33.
I downloaded newer versions of Gencode, and DUXAP10 only appears from v35 on.
However, using gencode v35, now there are 46 genes form the dataset not matching the annotation. Some of them are those with old aliases. But now there are 15 genes that have been removed from gencode since version v33 (their gene_id has been removed from ensembl its last appearance was v100)
How can I "discover" the actual source (or combination of sources) of the annotation file used to produce this dataset?
hi again, you are very right that my answer doesn't get you very far in solving what is admittedly a thorny problem. not so much thorny i guess, as just, "rote" and to some degree time intensive.
you are on the right track. what you need to do is curate a superset of the possible annotations and then map what you can to what you can, converging ulimately on a single up to date annotation set.
based on your answer, it seems like you are doing this yourself. this is how most people start, but generally speaking it is a better use of time to draw on other resources that have already done this.
consider, for instance, packages like
AnnotationHub
,GO.db
, etc. that already have pre-populated tables with these gene annotations by version for Gencode, GO, official gene symbol, HUGO, on and on.another good resource is UCSC table browser, if you spend enough time doing various things on there, eventually you'll see it is a very powerful resource for problems like this.
anyway, once you have what you consider to be a plausible superset of possible annotation Dbs, simply run all against all. the smoking gun is if one annotation Db actually gets all of them; but if this doesnt happen you have recourse...
that help?