Hi all,
After querying Ensembl (hg38) for all genes on chromosomes 1-Y, I found that about a third of the gene IDs are lacking an external gene name. I thought this might be poorly-annotated genes that weren't called by other groups and was thinking of excluding them from my analysis, but then I discovered that the hg19 version has no missing gene names.
hg38
ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl")
eight <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'chromosome_name', 'start_position', 'end_position', 'ensembl_peptide_id'), mart = ensembl)
eight %>% filter(chromosome_name %in% c(1:22, 'X', 'Y')) %>% distinct(ensembl_gene_id, .keep_all = TRUE) %>% group_by(external_gene_name == '') %>% count()
==> 40,554 gene IDs with a gene name, 20,014 with no gene name
hg19
seven <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'chromosome_name', 'start_position', 'end_position', 'ensembl_peptide_id'), mart = useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', host = 'grch37.ensembl.org'))
seven %>% filter(chromosome_name %in% c(1:22, 'X', 'Y')) %>% distinct(ensembl_gene_id, .keep_all = TRUE) %>% group_by(external_gene_name == '') %>% count()
==> 57,736 gene IDs with a gene name, 0 with no gene name.
Similarly if you count external_gene_names for hg38 the most frequent one is '' with 20,014 gene IDs while for hg19 it's Y-RNA.
I also checked a specific gene ID that's listed as having no external_gene_name in hg38 (ENSG00000121388) and found that it does have one in hg19. Also, that gene has no peptide_id in hg38 but does have one in hg19.
Why does hg38 have loads of blank external_gene_names while hg19 has none?
Thanks Emily! Would those genes be of lower annotation quality/less confident or would you recommend treating them the same as any other gene?
They're definitely less well characterised and less well studied.
Genes get assigned "proper" genes names by HGNC only if they are annotated in both Ensembl and RefSeq, so these genes are likely to have only been annotated by us. Our gene annotation is a lot more comprehensive whereas RefSeq tend to be more conservative. There is certainly evidence that these regions are transcribed, or we would not have annotated, but how important and functional they are may be up for debate.
Thank you!