Getting genes' sequences by querying gene symbols/names
3
0
Entering edit mode
3.0 years ago
langziv ▴ 70

Hi.

I've been looking for a human genome API for this purpose. I found the NCBI datasets API but there seems to be a problem with this package. I also found an Ensembl API but could only find a way to query it with gene IDs, so I started looking for an API to convert gene symbols to IDs. I found an R package but I'm using python.

Is there an API for retrieving human sequences by querying gene symbols, or is there an API for converting gene symbols to IDs?

I'm also using a Linux system, in case you know relevant Linux tools.

Thanks!

gene-sequence gene-id human gene-symbol • 2.5k views
ADD COMMENT
1
Entering edit mode
3.0 years ago
Emily 24k

You need to use the lookup/symbol endpoint in the Ensembl REST API with the sequence/id endpoint. Script around in your favourite programming language to run automatically.

ADD COMMENT
0
Entering edit mode

Great. Thanks. How could I have missed it.

ADD REPLY
0
Entering edit mode

Emily_Ensembl In second thought, you mean that there's a way to get gene sequences by querying with gene symbols, or that I need to get the IDs first, and then use them to get the sequences (which is what I'm doing now)? I'm not sure what you meant by "use the lookup/symbol endpoint in the Ensembl REST API with the sequence/id endpoint".

If there's a way to get the sequences in one call it will be better, obviously.

ADD REPLY
0
Entering edit mode

Use the output of lookup/symbol as the input for sequence/id. Exercise 5.1 of this online course does exactly that. You can copy/paste the code from the sample answers in either R or Python3.

ADD REPLY
1
Entering edit mode
3.0 years ago
GenoMax 147k

Using EntrezDirect.

Remove grep ">" to get individual exon sequences.
Change fasta_cds_na to fasta to get full sequence.

$ esearch -db gene -query "TP53 [gene] AND human [ORGN]" | elink -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta_cds_na | grep ">"
>lcl|NM_001126115.2_cds_NP_001119587.1_1 [gene=TP53] [db_xref=CCDS:CCDS73966.1] [protein=cellular tumor antigen p53 isoform d] [protein_id=NP_001119587.1] [location=30..815] [gbkey=CDS]
>lcl|NM_001126116.2_cds_NP_001119588.1_1 [gene=TP53] [db_xref=CCDS:CCDS73968.1] [protein=cellular tumor antigen p53 isoform e] [protein_id=NP_001119588.1] [location=30..659] [gbkey=CDS]
>lcl|NM_001126114.3_cds_NP_001119586.1_1 [gene=TP53] [db_xref=CCDS:CCDS45606.1] [protein=cellular tumor antigen p53 isoform b] [protein_id=NP_001119586.1] [location=143..1168] [gbkey=CDS]
>lcl|NM_001276697.3_cds_NP_001263626.1_1 [gene=TP53] [db_xref=CCDS:CCDS73963.1] [protein=cellular tumor antigen p53 isoform j] [protein_id=NP_001263626.1] [location=111..815] [gbkey=CDS]
>lcl|NM_001276696.3_cds_NP_001263625.1_1 [gene=TP53] [db_xref=CCDS:CCDS73971.1] [protein=cellular tumor antigen p53 isoform i] [protein_id=NP_001263625.1] [location=260..1168] [gbkey=CDS]
ADD COMMENT
0
Entering edit mode

Thanks.
I couldn't find information on the -target and -name parameters for elink. Also, can you query multiple genes? Something like
esearch -db gene -query "TP53 A2M [gene] AND human [ORGN]"

ADD REPLY
1
Entering edit mode

NCBI has a online book for EntrezDirect. It is not the easiest package to understand but once you get the basics it can be powerful query tool.

Since you are interested in retrieving specific information about each query you will want to do the queries individually. Use a for loop to walk through a set of gene ID's.

$ more id
TP53
A2M
TTN

$ for i in `cat id`; do esearch -db gene -query "${i} [gene] AND human [ORGN]" | elink -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta_cds_na ; done
ADD REPLY
0
Entering edit mode
3.0 years ago
MirianT_NCBI ▴ 760

The NCBI Datasets command line tool allows the user to download a list of genes by symbol from reference genomes. If you want to download a list of human genes, you can use this command below:

 datasets download gene symbol --inputfile list.txt --taxon human --filename human_genes.zip

When you say that there's a problem with the NCBI Datasets API, do you mean the Python package? Would you mind describing the problem you found?

Thanks, and feel free to reach out if you have any other questions :)

ADD COMMENT

Login before adding your answer.

Traffic: 2798 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6