Question

Getting genes' sequences by querying gene symbols/names

0

Entering edit mode

3.0 years ago

langziv ▴ 70

Hi.

I've been looking for a human genome API for this purpose. I found the NCBI datasets API but there seems to be a problem with this package. I also found an Ensembl API but could only find a way to query it with gene IDs, so I started looking for an API to convert gene symbols to IDs. I found an R package but I'm using python.

Is there an API for retrieving human sequences by querying gene symbols, or is there an API for converting gene symbols to IDs?

I'm also using a Linux system, in case you know relevant Linux tools.

Thanks!

gene-sequence gene-id human gene-symbol • 2.5k views

ADD COMMENT • link updated 3.0 years ago by MirianT_NCBI ▴ 760 • written 3.0 years ago by langziv ▴ 70

score 1 · Answer 1 · 2021-11-18

1

Entering edit mode

3.0 years ago

Emily 24k

You need to use the lookup/symbol endpoint in the Ensembl REST API with the sequence/id endpoint. Script around in your favourite programming language to run automatically.

ADD COMMENT • link 3.0 years ago by Emily 24k

0

Entering edit mode

Great. Thanks. How could I have missed it.

ADD REPLY • link 3.0 years ago by langziv ▴ 70

0

Entering edit mode

Emily_Ensembl In second thought, you mean that there's a way to get gene sequences by querying with gene symbols, or that I need to get the IDs first, and then use them to get the sequences (which is what I'm doing now)? I'm not sure what you meant by "use the lookup/symbol endpoint in the Ensembl REST API with the sequence/id endpoint".

If there's a way to get the sequences in one call it will be better, obviously.

ADD REPLY • link 3.0 years ago by langziv ▴ 70

0

Entering edit mode

Use the output of lookup/symbol as the input for sequence/id. Exercise 5.1 of this online course does exactly that. You can copy/paste the code from the sample answers in either R or Python3.

ADD REPLY • link 3.0 years ago by Emily 24k

score 1 · Answer 2 · 2021-11-18

Using EntrezDirect.

Remove grep ">" to get individual exon sequences.
Change fasta_cds_na to fasta to get full sequence.

$ esearch -db gene -query "TP53 [gene] AND human [ORGN]" | elink -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta_cds_na | grep ">"
>lcl|NM_001126115.2_cds_NP_001119587.1_1 [gene=TP53] [db_xref=CCDS:CCDS73966.1] [protein=cellular tumor antigen p53 isoform d] [protein_id=NP_001119587.1] [location=30..815] [gbkey=CDS]
>lcl|NM_001126116.2_cds_NP_001119588.1_1 [gene=TP53] [db_xref=CCDS:CCDS73968.1] [protein=cellular tumor antigen p53 isoform e] [protein_id=NP_001119588.1] [location=30..659] [gbkey=CDS]
>lcl|NM_001126114.3_cds_NP_001119586.1_1 [gene=TP53] [db_xref=CCDS:CCDS45606.1] [protein=cellular tumor antigen p53 isoform b] [protein_id=NP_001119586.1] [location=143..1168] [gbkey=CDS]
>lcl|NM_001276697.3_cds_NP_001263626.1_1 [gene=TP53] [db_xref=CCDS:CCDS73963.1] [protein=cellular tumor antigen p53 isoform j] [protein_id=NP_001263626.1] [location=111..815] [gbkey=CDS]
>lcl|NM_001276696.3_cds_NP_001263625.1_1 [gene=TP53] [db_xref=CCDS:CCDS73971.1] [protein=cellular tumor antigen p53 isoform i] [protein_id=NP_001263625.1] [location=260..1168] [gbkey=CDS]

score 0 · Answer 3 · 2021-12-09

The NCBI Datasets command line tool allows the user to download a list of genes by symbol from reference genomes. If you want to download a list of human genes, you can use this command below:

 datasets download gene symbol --inputfile list.txt --taxon human --filename human_genes.zip

When you say that there's a problem with the NCBI Datasets API, do you mean the Python package? Would you mind describing the problem you found?

Thanks, and feel free to reach out if you have any other questions :)