Hi,
You can use NCBI Datasets for this task.
Using the NCBI Datasets gene option, you can download using a list of accessions (you mentioned hundreds of genes). Assumptions:
- You want to have each CDS as a separate FASTA file (instead of CDS from different genes all in the same FASTA). I"m using a loop for that
- You're working from a new folder (let's call it all-genes)
I'm using a txt file with two gene accessions (one per line):
cat list.txt
NM_021804.3
NM_001354526.1
Here's the example:
cat list.txt | while read GENE; do
datasets download gene accession "$GENE" --include cds --filename "$GENE".zip;
done
By default, when a user requests a download by gene accession, NCBI Datasets assumes that the user wants all sequences under the same gene-id. For example: for the accession you provided (NM_001354526.1), there are 68 sequences in the CDS FASTA. If you want to restrict your download to only the accession you searched for, you can add the the flag -fasta-filter
to the download command. Like this:
cat list.txt | while read GENE; do
datasets download gene accession "$GENE" --include cds --fasta-filter "$GENE" --filename "$GENE".zip;
done
After downloading the gene data packages, you can unzip them and you will find this folder structure:
for f in *.zip; do unzip $f -d ${f/.zip/}; done
Archive: NM_001354526.1.zip
inflating: NM_001354526.1/README.md
inflating: NM_001354526.1/ncbi_dataset/data/cds.fna
inflating: NM_001354526.1/ncbi_dataset/data/data_report.jsonl
inflating: NM_001354526.1/ncbi_dataset/data/dataset_catalog.json
Archive: NM_021804.3.zip
inflating: NM_021804.3/README.md
inflating: NM_021804.3/ncbi_dataset/data/cds.fna
inflating: NM_021804.3/ncbi_dataset/data/data_report.jsonl
inflating: NM_021804.3/ncbi_dataset/data/dataset_catalog.json
For each gene package, you have the same folder structure. You will find the CDS FASTA in the folder ncbi_dataset/data. The CDS FASTA header has the following format:
>NM_021804.3:307-2724 ACE2 [organism=Homo sapiens] [GeneID=59272] [transcript=2] [region=cds]
To extract the metadata information you mentioned (chromosome, organism, etc), you can use the command datasets summary
, which outputs a JSON report to the screen that you can parse using jq
:
cat list.txt | while read GENE; do
datasets summary gene accession "$GENE" | jq -r '[.reports[]
| .query[],.gene.taxname,.gene.chromosomes[],.gene.symbol,.gene.gene_id, .gene.description]
| @csv' >> gene-info.csv;
done
I hope it helps. Please feel free to ask any questions or let me know of any issues.
How about this (sequence truncated for space) :
this does not include organism name in the header though.
You could run two commands to get the organism name and the perhaps write a script to incorporate it into the header.
hey GenoMax, thanks for your response! So sorry that I was delayed in writing back, the cluster I use was down for a bit so I moved on.. However, I did try what you suggested. Not sure if it's user error, but I wasn't able to get the docsum command to work. Using gpc as the format seemed to work best for me.
Mirian's suggestion hit the nail on the head, so no script writing for me - woohoo! thanks again for your comment though! :)