Dear all, I am working with a list of Ensembl accession codes for a desired group of proteins.
I have downloaded the protein annotations related to the genome assembly GRCH38.
I fetched the genomic coordinates from UniProtKB API service using the Ensembl accession codes. The service provide a protein annotation records with the coordinate needed.
However, I would like to fetch the same coordinates parsing locally the GRCh38 data, instead to query an online database. I think I found a way that involves FASTA protein sequences file and a GTF protein annotations file for the GRCh38 genome assembly. Through the Ensembl proteins codes (in FASTA sequences) it would be possible to find the Ensembl genes codes in the GTF annotations, and finally in the same annotations, the desired genomic coordinates. Nevertheless, the last update for the GTF annotations file is 19-Mar-2021 while for the protein sequences in FASTA format is from 27-Mar-2021 (today is 19-Sep-2021).
From this discrepancy, it is raised my doubt about the most up-to-date information available.
Now I am wondering:
If I query UniProtKB through an API service, it is possible to find protein annotations not yet included in the GTF annotations set related to a specific genome assembly?(in this case GRCh38 of 27-Mar-2021). In other words, protein annotations fetched from UniProtKB, could be more updated than the 27-Mar-2021 GTF annotations related to the GRCh38?
Moreover:
It is possible that in the UniProtKB database are stored proteins codes with a correspondence in Enseble database (cross-link section in UniProt webpages) but not yet included in the GRCh38 GTF annotations, downloadable through Ensembl FTP service? (I mean the GTF file Homo_sapiens.GRCh38.104.chr.gtf.gz, in this repository http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/)
I am asking this, because if I am interested in the latest update of protein codes and annotations, I think that should be considered the amount of new codes and annotations that are potentially submitted each month. In light of this, if the online databases for instance, update their content with a more higher frequency compare to the genome assembly, I will go for the API querying strategy.
Thanks for your answers.
Thank you for the explanation. I appreciated your figure very much. Sometimes it is very difficult to understand how the databases are connected to eachother.
Have you got some suggestions about resources that could explain how to understand the relationship between different annotations?
Regarding my use of the GRCh38 assembly, I realized that it was not clear enough. I downloaded the FASTA and GTF files from this FTP service in Ensembl: https://www.ensembl.org/info/data/ftp/index.html.
On the Human row, I selected 'Protein sequence (FASTA)' and 'Gene sets' (in GTF format).
Because in both file names there was the indication of GRCh38, I supposed that it has been the reference genome assembly for the annotations and protein sequences stored in the files. For this reason, sometimes I am wrong referring to the assembly instead of proteins annotations or sequences.
Please let me know if I am wrong considering GRCh38 assembly in this way?
It is true that the annotation is inextricably tied to the assembly, and makes no sense without it so should be quoted. But in terms of reproducibility, the release number is more relevant. The annotation is constantly updated, whereas the assembly stays the same. For someone (including yourself) to be able to match up your work to the annotation you worked with, you should include the Ensembl release number.