Translating gene names to entrez id's
3
1
Entering edit mode
3.8 years ago
Julian ▴ 10

Hi all,

I have many gene names I'm trying to map to entrez ids.

Right now I use the esearch module in biopython to query them 1 by 1 but this takes some time for 30000 gene names and ideally I would like it to be faster. I assume it would be faster if I could query 30000 at once instead of doing 30000 queries.

This is my current implementation:

for line in f.readlines():
            line = [lineitem.strip('"') for lineitem in line.strip().split()]
            gene = line[0]
            # Search NCBI for existing gene ids
            gene_id = None
            handle = Entrez.esearch(db="gene", term="Homo sapiens[orgn] AND "+ gene + "[Gene Name]")
            record = Entrez.read(handle)
            try: 
                gene_id = record["IdList"][0]
            except:
                pass
            handle.close()

this works but I would like a better solution. Is there a better way to approach this?

Kind regards, Julian

database annotation biopython gene python • 7.0k views
ADD COMMENT
2
Entering edit mode
ADD REPLY
0
Entering edit mode

The correct term for the identifiers you're calling "gene names" is HGNC Gene Symbols. HGNC also has "Gene Names", which are more like descriptions than one-word symbols.

ADD REPLY
5
Entering edit mode
3.8 years ago
GenoMax 148k

Using EntrezDirect directly instead of via Python:

$ esearch -db gene -query "human [orgn]" | efetch -format tabular | awk -F "\t" '{OFS="\t"}{print $6,$3}'
Symbol  GeneID
LOC120893160    120893160
LOC120893158    120893158
LOC120893156    120893156
LOC120893154    120893154
LOC120893152    120893152
LOC120893150    120893150
LOC120893148    120893148
LOC120893146    120893146
LOC120893144    120893144

To get only live entries that don't have LOC in the name. (Remove part from grep on to get them all).

$ esearch -db gene -query "human [orgn]" | efetch -format tabular | awk -F "\t" '{OFS="\t"}{if ($5 == "live") print $6,$3}' | grep -v "LOC" | head -10
SLC17A6-DT  120883619
TTC12-DT    120883617
TPBGL-AS1   120883615
PATL1-DT    120883613
TP53    7157
IGSF22-AS1  120883618
EMSY-DT 120883616
CCDC90B-AS1 120883614
EGFR    1956
TNF 7124
ADD COMMENT
0
Entering edit mode

Thank you so much, I was able to call this from within the script and get the results I wanted.

ADD REPLY
3
Entering edit mode
3.8 years ago

Hi,

Adding this more as a reference answer for future users.

genes <- c('BRCA1', 'BRCA2', 'BRCC3', 'ATM', 'TP53')

1, via org.Hs.eg.db

require(org.Hs.eg.db)
mapIds(
  org.Hs.eg.db,
  keys = genes,
  column = 'ENTREZID',
  keytype = 'SYMBOL')

BRCA1   BRCA2   BRCC3     ATM    TP53 
"672"   "675" "79184"   "472"  "7157" 


select(
  org.Hs.eg.db,
  keys = genes,
  column = c('SYMBOL', 'ENTREZID', 'ENSEMBL'),
  keytype = 'SYMBOL')

  SYMBOL ENTREZID         ENSEMBL
1  BRCA1      672 ENSG00000012048
2  BRCA2      675 ENSG00000139618
3  BRCC3    79184 ENSG00000185515
4    ATM      472 ENSG00000149311
5   TP53     7157 ENSG00000141510

2, via biomaRt

require(biomaRt)
ensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')

annot <- getBM(
  attributes = c(
    'hgnc_symbol',
    'external_gene_name',
    'ensembl_gene_id',
    'entrezgene_id',
    'gene_biotype'),
  filters = 'external_gene_name',
  values = genes,
  mart = ensembl)

annot <- merge(
  x = as.data.frame(genes),
  y =  annot,
  by.y = 'external_gene_name',
  all.x = T,
  by.x = 'genes')

annot
  genes hgnc_symbol ensembl_gene_id entrezgene_id   gene_biotype
1   ATM         ATM ENSG00000149311           472 protein_coding
2 BRCA1       BRCA1 ENSG00000012048           672 protein_coding
3 BRCA2       BRCA2 ENSG00000139618           675 protein_coding
4 BRCC3       BRCC3 ENSG00000185515         79184 protein_coding
5  TP53        TP53 ENSG00000141510          7157 protein_coding

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin, this helped a lot! I am in a situation where there is some keys mismatch between my TxDb and org.db object. I made it work using the mapIds() approach. I can't thank you enough. Take care

ADD REPLY
0
Entering edit mode
3.8 years ago
vkkodali_ncbi ★ 3.8k

Adding NCBI Datasets as another option. Specifically, go to the Data Tables part of it and load all of the gene names either by copy/pasting the list or uploading your text file. You will be able to download either a gene-centric table or a transcript-centric table that can be opened in Excel, imported into R, etc, as well as sequence data if needed.

ADD COMMENT

Login before adding your answer.

Traffic: 1857 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6