GRCh37/38(NCBI) vs hg19/hg38(UCSC)
2
44
Entering edit mode
10.2 years ago
pwg46 ▴ 540

Are there any major differences between the GRCh38 (NCBI) and hg38(UCSC) databases, aside from the fact that GRCh38 uses a 1-based coordinate system, while UCSC uses a 0-based coordinate system? Are there any pros/cons in using one vs the other? And, I am guessing that any identifier conversion software (e.g, BioMart) should choose one database over the other? Also, where does Ensembl come into play? Is the Ensembl database just a subset of the GRCh38 (NCBI) database? Any clarification would be greatly appreciated.

ncbi ucsc grch38 hg38 • 81k views
ADD COMMENT
67
Entering edit mode
10.2 years ago

GRCh37/hg19 and GRCh38 are genome builds rather than annotations, which describe where features are in a given genome build. The actual sequences you'll get from NCBI/UCSC/Ensembl will be identical, but their annotations will be different and (importantly) updated at different frequencies. NCBI's annotation is the "refseq" dataset (the "refGene" track in UCSC), which is essentially a subset of the UCSC and Ensembl annotations. UCSC's annotations are kind of a mess. You'll find genes with the same ID on multiple strand and multiple chromosomes, which makes them a bit useless. Ensembl's annotations typically contain more features than UCSC (so a bit more noise), but they're otherwise much better put together (e.g., you'll never find a gene ID on different strand or different chromosomes) and their IDs are typically easier to map to other things (e.g., gene names, GO and pathway memberships). Ensembl also updates its annotation fairly often and versions everything nicely, so it's quite convenient to report what version you used in a paper (reproducibility is always a good thing). Given the choice, use the Ensembl annotation.

BTW, don't forget that the various sources can use different names for chromosomes (e.g., chr1 in UCSC is just 1 in Ensembl), so don't mix and match them.

ADD COMMENT
1
Entering edit mode

I see. Thank you for your answer. So, right now I am using the Ensembl and Uniprot databases. Would there be any reason to include the UCSC database if I am working with an identifier conversion tool? E.g, say I am trying to map Ensembl Transcript (ENST) identifiers to Uniprot. Would I get any different mappings converting directly from ENST->Uniprot (both Ensembl and Uniprot dbs have data files which do so) than converting from ENST->UCSC->Uniprot?

ADD REPLY
0
Entering edit mode

You might get more ambiguous mappings going via UCSC (or not, it's hard to say).

ADD REPLY
0
Entering edit mode

Okay. So, in general, do you think it would be wise to stick only with the Ensembl database and not mix the two (Ensembl and UCSC) with respect to an identifier conversion software?

ADD REPLY
2
Entering edit mode

Yeah, you'll normally just have more headaches by mixing the two and Ensembl is typically one of the more supported IDs.

ADD REPLY
3
Entering edit mode

No need to map IDs between resources yourself, EnsEMBL has good cross-references to many other databases including UniProt. You can access those either via BioMart or with the API.

ADD REPLY
1
Entering edit mode

The UCSC Genome Browser just released an "NCBI RefSeq" track that is based entirely on coordinates and alignments provided by the RefSeq group. These new tracks should avoid the issue of genes mapping to multiple locations, etc. You can read about it more on our website: https://genome.ucsc.edu/goldenPath/newsarch.html#030317.

Matthew Speir
UCSC Genome Bioinformatics Group

ADD REPLY
3
Entering edit mode
10.1 years ago
Denise CS ★ 5.2k

In addition to BioMart and the Perl API, you can also use the Ensembl REST API to map Ensembl IDs to cross reference entries and vice versa.

ADD COMMENT

Login before adding your answer.

Traffic: 2671 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6