Question

Journey from gene id to gene sequence

0

Entering edit mode

22 months ago

Shweta • 0

Can you tell me how to download gene sequences with 2500 gene ids?

NCBI Gene-id • 844 views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 22 months ago by Shweta • 0

5

Entering edit mode

Don't take this the wrong way but you have been posting single line questions for some time. This shows that you are not putting in enough effort/thought in the question at hand.

Based on tags you added it appears that you are interested in getting this information from NCBI but simply saying "gene id" does not tell us what ID's you are working with. Unless that information is included it is difficult for people to provide answers. Please edit the original question and add some examples. Tell us what you have tried to do so far.

ADD REPLY • link 22 months ago by GenoMax 148k

score 0 · Answer 1 · 2023-02-27

0

Entering edit mode

22 months ago

Dave Carlson ★ 2.1k

If you have Ensembl IDs, you could consider using gget:

https://github.com/pachterlab/gget

ADD COMMENT • link 22 months ago by Dave Carlson ★ 2.1k

score 0 · Answer 2 · 2023-02-27

Hi Shweta,
If you are referring to NCBI Gene IDs, you can use NCBI Datasets for that task. To download only gene sequences, you can use the following command:

datasets download gene gene-id --inputfile mylist.txt --include gene

This command will download a zip archive (ncbi_dataset.zip) with the gene sequences of the gene-ids in your list (mylist.txt in the example) plus metadata information about the genes as a JSON-Lines file (data_report.jsonl). Unzipping into a new folder will produce this result:

unzip ncbi_dataset.zip -d mygenes
Archive:  ncbi_dataset.zip
  inflating: mygenes/README.md       
  inflating: mygenes/ncbi_dataset/data/gene.fna  
  inflating: mygenes/ncbi_dataset/data/data_report.jsonl  
  inflating: mygenes/ncbi_dataset/data/dataset_catalog.json

One potential issue is that all genes will be in the same FASTA file (gene.fna). If you want each gene as a separate FASTA, you can loop over the list and download each gene-id as it's own data package:

cat mylist.txt | while read GENEID; do
  datasets download gene gene-id "${GENEID}" --include gene --filename "${GENEID}".zip;
done

I hope it helps!