Question

NCBI vs Ensembl - Ortholog genes Information

0

Entering edit mode

15 months ago

José ▴ 10

Hi, I am currently working on a project and I have been extracting some genetic sequences from whole genomes using exonerate protein2genome model.

The problem is that I found contraditory information regarding some sequences I obtained. For example, Ensembl says that there are no orthologous sequences for a gene in Dasypus novemcinctus, on the other hand, NCBI has that sequence for Dasypus and considers it an ortholog of the human gene. Additionaly, I obtained a perfect sequence with a high raw score in exonerate. So, it seems to me that the sequence exists in Dasypus novemcinctus and its an ortholog of the human gene. How can I be certain of this?

Another strange thing is that in Ensembl some species almost never have orthologs, like Nothoprocta perdicaria, which is extremly strange, although it is a Paleognath.

What is the best practice for this cases?

Thank you

Genomics Orthologs NCBI Ensembl • 2.7k views

ADD COMMENT • link 15 months ago by José ▴ 10

3

Entering edit mode

15 months ago

b.contreras.moreira ▴ 460

Hi José, orthology inference is tricky and that's why the https://questfororthologs.org consortium has been comparing graph-based and tree-based methodologies for 10+ years.

At Ensembl, the pipeline to call orthologues is called Ensembl Compara and it is tree-based, as described at https://www.ensembl.org/info/genome/compara/homology_method.html. In the latest quest benchmark (https://academic.oup.com/nar/article/50/W1/W623/6584783) the performance of Compara was globally summarized as "Recall over precision". For a gene of interest you can actually check the table of orthologues and see what kind of Gene Order Conservation (GOC) or Whole Genome Alignment (WGA) evidence support each orthologous gene. See for instance http://www.ensembl.org/Dasypus_novemcinctus/Gene/Compara_Ortholog?db=core;g=ENSDNOG00000016414;r=JH569934.1:90494-92039;t=ENSDNOT00000016414

According to https://www.ncbi.nlm.nih.gov/kis/info/how-are-orthologs-calculated, the NCBI approach to call orthologues seems to be graph-based, as "the reference genome is searched best and near-best matches based on protein sequence similarity. Candidates are further analyzed for nucleotide sequence similarity across all exons (including UTRs), and an additional 2kb sequence on either side of the gene". Importantly, it also leverages "microsynteny within the local genomic neighborhood (+/- 10 genes)".

If you have a small number of genes of interest, my advise would be to check the evidence supporting the orthology calls whatever the method, and take them if the evidence checks.

ADD COMMENT • link 15 months ago by b.contreras.moreira ▴ 460

0

Entering edit mode

Thanks for the answer and the advice.

ADD REPLY • link 15 months ago by José ▴ 10

1

Entering edit mode

15 months ago

liorglic ★ 1.5k

Orthology analyses are always a bit complicated and it is expected that when you use different DBs you would get different results. This is because the DBs you mentioned differ in both the annotations they contain and the way they perform orthology clustering.
There is no simple answer regarding what you should do. It really depends on what you are trying to achieve in the next steps. You can probably choose one of the following approaches:

Choose either ENSEMBL or NCBI and stick with it
Try combining them, that is - if no ortholog is found in NCBI, take from ENSEMBL and vice versa
Cross-check - take an ortholog only if it appears in both NCBI and ENSEMBL

Again, the choice should depend on the next steps and on the specific results - how many data points you need vs. how many you get with each method.

ADD COMMENT • link 15 months ago by liorglic ★ 1.5k

0

Entering edit mode

Thak you so much for your answer and advice.

ADD REPLY • link 15 months ago by José ▴ 10

score 4 · Accepted Answer · 2024-05-29

I believe the issue you're seeing may stem from the annotation sets more than differences in orthology methods (although both can certainly contribute). While a species may have annotation in both NCBI RefSeq (the data used for NCBI orthologs) and Ensembl, they may use different assemblies due to either preferences or timing or both. And on top of that different annotation pipelines may miss some genes that are present in the assembly. Taken together, it's easier to interpret where data can be matched up whereas presence/absence differences may be biological or technical. Plus there are many species with annotation available only in one of the databases and not the other.

For Dasypus novemcinctus, there's a difference in assemblies:

NCBI RefSeq: https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=9361&annotated_only=true&refseq_annotation=true GCA_030445035.1 / GCF_030445035.1 = mDasNov1.hap2, a recent high-quality assembly from VGP annotated for RefSeq in July 2023.
Ensembl full: https://useast.ensembl.org/Dasypus_novemcinctus/Info/Index GCA_000208655.2 = Dasnov3.0, a much older assembly from Baylor with a contig N50 of only 26 kb gene build was in 2013, last updated in 2016

So while the species is in both resources, Ensembl could be missing genes because of gaps in the old Dasnov3.0 assembly, or differences in annotation methodology. Running BUSCO on the annotation sets, it looks like the Ensembl annotation scores as 87.6% complete with the eutheria_odb10 set, compared to 91.4% for the prior RefSeq annotation on the same Dasnov3.0 assembly (so a bit better, but both are poor). The current RefSeq annotation on mDasNov1.hap2 scores 96.9% complete. At face value that could cause 9.3% (96.9%-87.6%) of genes to be found in the current RefSeq set and be missing in Ensembl.

For Nothoprocta perdicaria,

Both are using GCA_003342845.1 = notPer1 from 2019, so it's not an assembly difference. But we found BUSCO with the aves_odb10 set reports the Ensembl annotation as 83.0% complete compared to 98.4% for RefSeq. UniProt has similar stats for the Ensembl annotation (https://www.uniprot.org/proteomes/UP000694420), so this isn't some anomaly of our BUSCO data. I don't see any BAM coverage files under: https://ftp.ensembl.org/pub/release-112/bamcov/ so it might be that the Ensembl annotation for Nothoprocta perdicaria was generated without using RNA-seq. Someone at Ensembl might be able to confirm that. That could easily result in the BUSCO difference.

Both NCBI and Ensembl browsers allow you to visualize both annotation sets when available on the same assembly, so reviewing that data can help to understand where differences may be coming from.