I've been looking for a human genome API for this purpose. I found the NCBI datasets API but there seems to be a problem with this package. I also found an Ensembl API but could only find a way to query it with gene IDs, so I started looking for an API to convert gene symbols to IDs. I found an R package but I'm using python.
Is there an API for retrieving human sequences by querying gene symbols, or is there an API for converting gene symbols to IDs?
I'm also using a Linux system, in case you know relevant Linux tools.
You need to use the lookup/symbol endpoint in the Ensembl REST API with the sequence/id endpoint. Script around in your favourite programming language to run automatically.
Emily_Ensembl In second thought, you mean that there's a way to get gene sequences by querying with gene symbols, or that I need to get the IDs first, and then use them to get the sequences (which is what I'm doing now)? I'm not sure what you meant by "use the lookup/symbol endpoint in the Ensembl REST API with the sequence/id endpoint".
If there's a way to get the sequences in one call it will be better, obviously.
Use the output of lookup/symbol as the input for sequence/id. Exercise 5.1 of this online course does exactly that. You can copy/paste the code from the sample answers in either R or Python3.
Thanks.
I couldn't find information on the -target and -name parameters for elink. Also, can you query multiple genes? Something like esearch -db gene -query "TP53 A2M [gene] AND human [ORGN]"
NCBI has a online book for EntrezDirect. It is not the easiest package to understand but once you get the basics it can be powerful query tool.
Since you are interested in retrieving specific information about each query you will want to do the queries individually. Use a for loop to walk through a set of gene ID's.
$ more id
TP53
A2M
TTN
$ for i in`cat id`;do esearch -db gene -query "${i} [gene] AND human [ORGN]"| elink -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta_cds_na ;done
The NCBI Datasets command line tool allows the user to download a list of genes by symbol from reference genomes. If you want to download a list of human genes, you can use this command below:
datasets download gene symbol --inputfile list.txt --taxon human --filename human_genes.zip
When you say that there's a problem with the NCBI Datasets API, do you mean the Python package? Would you mind describing the problem you found?
Thanks, and feel free to reach out if you have any other questions :)
Great. Thanks. How could I have missed it.
Emily_Ensembl In second thought, you mean that there's a way to get gene sequences by querying with gene symbols, or that I need to get the IDs first, and then use them to get the sequences (which is what I'm doing now)? I'm not sure what you meant by "use the lookup/symbol endpoint in the Ensembl REST API with the sequence/id endpoint".
If there's a way to get the sequences in one call it will be better, obviously.
Use the output of lookup/symbol as the input for sequence/id. Exercise 5.1 of this online course does exactly that. You can copy/paste the code from the sample answers in either R or Python3.