Are there any major differences between the GRCh38 (NCBI) and hg38(UCSC) databases, aside from the fact that GRCh38 uses a 1-based coordinate system, while UCSC uses a 0-based coordinate system? Are there any pros/cons in using one vs the other? And, I am guessing that any identifier conversion software (e.g, BioMart) should choose one database over the other? Also, where does Ensembl come into play? Is the Ensembl database just a subset of the GRCh38 (NCBI) database? Any clarification would be greatly appreciated.
I see. Thank you for your answer. So, right now I am using the Ensembl and Uniprot databases. Would there be any reason to include the UCSC database if I am working with an identifier conversion tool? E.g, say I am trying to map Ensembl Transcript (ENST) identifiers to Uniprot. Would I get any different mappings converting directly from ENST->Uniprot (both Ensembl and Uniprot dbs have data files which do so) than converting from ENST->UCSC->Uniprot?
You might get more ambiguous mappings going via UCSC (or not, it's hard to say).
Okay. So, in general, do you think it would be wise to stick only with the Ensembl database and not mix the two (Ensembl and UCSC) with respect to an identifier conversion software?
Yeah, you'll normally just have more headaches by mixing the two and Ensembl is typically one of the more supported IDs.
No need to map IDs between resources yourself, EnsEMBL has good cross-references to many other databases including UniProt. You can access those either via BioMart or with the API.
The UCSC Genome Browser just released an "NCBI RefSeq" track that is based entirely on coordinates and alignments provided by the RefSeq group. These new tracks should avoid the issue of genes mapping to multiple locations, etc. You can read about it more on our website: https://genome.ucsc.edu/goldenPath/newsarch.html#030317.
Matthew Speir
UCSC Genome Bioinformatics Group