Question

20,000 Ensembl gene IDs with external gene name in hg19 but not in hg38

0

Entering edit mode

4.2 years ago

loughrae ▴ 90

Hi all,

After querying Ensembl (hg38) for all genes on chromosomes 1-Y, I found that about a third of the gene IDs are lacking an external gene name. I thought this might be poorly-annotated genes that weren't called by other groups and was thinking of excluding them from my analysis, but then I discovered that the hg19 version has no missing gene names.

hg38

ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
eight <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'chromosome_name', 'start_position', 'end_position', 'ensembl_peptide_id'), mart = ensembl)
eight %>% filter(chromosome_name %in% c(1:22, 'X', 'Y')) %>% distinct(ensembl_gene_id, .keep_all = TRUE) %>% group_by(external_gene_name == '') %>% count()

==> 40,554 gene IDs with a gene name, 20,014 with no gene name

hg19

seven <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'chromosome_name', 'start_position', 'end_position', 'ensembl_peptide_id'), mart = useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', host = 'grch37.ensembl.org'))
seven %>% filter(chromosome_name %in% c(1:22, 'X', 'Y')) %>% distinct(ensembl_gene_id, .keep_all = TRUE) %>% group_by(external_gene_name == '') %>% count()

==> 57,736 gene IDs with a gene name, 0 with no gene name.

Similarly if you count external_gene_names for hg38 the most frequent one is '' with 20,014 gene IDs while for hg19 it's Y-RNA.

I also checked a specific gene ID that's listed as having no external_gene_name in hg38 (ENSG00000121388) and found that it does have one in hg19. Also, that gene has no peptide_id in hg38 but does have one in hg19.

Why does hg38 have loads of blank external_gene_names while hg19 has none?

ensembl biomart • 2.6k views

ADD COMMENT • link 4.2 years ago by loughrae ▴ 90

score 2 · Accepted Answer · 2021-05-06

2

Entering edit mode

4.2 years ago

Emily 24k

All clone-based gene names, like RP11-408E5.4 were retired in Ensembl release 104.

ADD COMMENT • link 4.2 years ago by Emily 24k

0

Entering edit mode

Thanks Emily! Would those genes be of lower annotation quality/less confident or would you recommend treating them the same as any other gene?

ADD REPLY • link 4.2 years ago by loughrae ▴ 90

0

Entering edit mode

They're definitely less well characterised and less well studied.

Genes get assigned "proper" genes names by HGNC only if they are annotated in both Ensembl and RefSeq, so these genes are likely to have only been annotated by us. Our gene annotation is a lot more comprehensive whereas RefSeq tend to be more conservative. There is certainly evidence that these regions are transcribed, or we would not have annotated, but how important and functional they are may be up for debate.