Question

Downloading bulk coding sequences from NCBI that include the organism name in the sequence header?

0

Entering edit mode

18 months ago

Em ▴ 20

Hi everyone!

I've tried searching for a possible solution to this for a little bit but haven't had any luck... However, I know there has to be someone out there with one!

I'm interested in downloading only the coding sequences for hundreds of genes in a FASTA nucleotide format from NCBI. I know it's easy to just download the coding sequences by using the 'send to:' feature. However, I need more information in the FASTA header, such as the source organism name and the chromosome.

I've tried using the following command for NCBI EDirect. I was able to retrieve the organism name, but I'm extracting the entire gene sequence - not the coding sequence:

$ efetch -db nuccore -id NM_001354526.1 -format gpc | xtract -pattern INSDSeq -element INSDSeq_organism INSDSeq_sequence

It seems that efetch is likely the best option, but I'm having a bit of difficulties with the syntax.

Thanks so much in advance, and please let me know if more context is needed.

efetch edirect ncbi • 2.2k views

ADD COMMENT • link updated 18 months ago by GenoMax 147k • written 18 months ago by Em ▴ 20

0

Entering edit mode

How about this (sequence truncated for space) :

$ efetch -db nuccore -id NM_001354526.1 -format fasta_cds_na
>lcl|NM_001354526.1_cds_NP_001341455.1_1 [gene=UBE3A] [db_xref=CCDS:CCDS32177.1] [protein=ubiquitin-protein ligase E3A isoform 1] [protein_id=NP_001341455.1] [location=663..3221] [gbkey=CDS]
ATGAAGCGAGCAGCTGCAAAGCATCTAATAGAACGCTACTACCACCAGTTAACTGAGGGCTGTGGAAATG
AAGCCTGCACGAATGAGTTTTGTGCTTCCTGTCCAACTTTTCTTCGTATGGATAATAATGCAGCAGCTAT
TAAAGCCCTCGAGCTTTATAAGATTAATGCAAAACTCTGTGATCCTCATCCCTCCAAGAAAGGAGCAAGC
TCAGCTTACCTTGAGAACTCGAAAGGTGCCCCCAACAACTCCTGCTCTGAGATAAAAATGAACAAGAAAG
GCGCTAGAATTGATTTTAAAGATGTGACTTACTTAACAGAAGAGAAGGTATATGAAATTCTTGAATTATG
TAGAGAAAGAGAGGATTATTCCCCTTTAATCCGTGTTATTGGAAGAGTTTTTTCTAGTGCTGAGGCATTG

this does not include organism name in the header though.

You could run two commands to get the organism name and the perhaps write a script to incorporate it into the header.

$ efetch -db nuccore -id NM_001354526.1 -format docsum | xtract -pattern DocumentSummary -element Organism; efetch -db nuccore -id NM_001354526.1 -format fasta_cds_na
Homo sapiens
>lcl|NM_001354526.1_cds_NP_001341455.1_1 [gene=UBE3A] [db_xref=CCDS:CCDS32177.1] [protein=ubiquitin-protein ligase E3A isoform 1] [protein_id=NP_001341455.1] [location=663..3221] [gbkey=CDS]
ATGAAGCGAGCAGCTGCAAAGCATCTAATAGAACGCTACTACCACCAGTTAACTGAGGGCTGTGGAAATG
AAGCCTGCACGAATGAGTTTTGTGCTTCCTGTCCAACTTTTCTTCGTATGGATAATAATGCAGCAGCTAT

ADD REPLY • link 18 months ago by GenoMax 147k

0

Entering edit mode

hey GenoMax, thanks for your response! So sorry that I was delayed in writing back, the cluster I use was down for a bit so I moved on.. However, I did try what you suggested. Not sure if it's user error, but I wasn't able to get the docsum command to work. Using gpc as the format seemed to work best for me.

$ efetch -db nuccore -id NM_001354526.1 -format gpc | xtract -pattern INSDSeq -element INSDSeq_organism INSDSeq_accession-version; efetch -db nuccore -id NM_001354526.1 -format fasta_cds_na

Mirian's suggestion hit the nail on the head, so no script writing for me - woohoo! thanks again for your comment though! :)

ADD REPLY • link 18 months ago by Em ▴ 20

score 5 · Accepted Answer · 2023-05-23

Hi,

You can use NCBI Datasets for this task.

Using the NCBI Datasets gene option, you can download using a list of accessions (you mentioned hundreds of genes). Assumptions:

You want to have each CDS as a separate FASTA file (instead of CDS from different genes all in the same FASTA). I"m using a loop for that
You're working from a new folder (let's call it all-genes)

I'm using a txt file with two gene accessions (one per line):

cat list.txt
NM_021804.3
NM_001354526.1

Here's the example:

cat list.txt | while read GENE; do
datasets download gene accession "$GENE" --include cds --filename "$GENE".zip;
done

By default, when a user requests a download by gene accession, NCBI Datasets assumes that the user wants all sequences under the same gene-id. For example: for the accession you provided (NM_001354526.1), there are 68 sequences in the CDS FASTA. If you want to restrict your download to only the accession you searched for, you can add the the flag -fasta-filter to the download command. Like this:

cat list.txt | while read GENE; do
datasets download gene accession "$GENE" --include cds --fasta-filter "$GENE" --filename "$GENE".zip;
done

After downloading the gene data packages, you can unzip them and you will find this folder structure:

for f in *.zip; do unzip $f -d ${f/.zip/}; done
Archive:  NM_001354526.1.zip
  inflating: NM_001354526.1/README.md  
  inflating: NM_001354526.1/ncbi_dataset/data/cds.fna  
  inflating: NM_001354526.1/ncbi_dataset/data/data_report.jsonl  
  inflating: NM_001354526.1/ncbi_dataset/data/dataset_catalog.json  
Archive:  NM_021804.3.zip
  inflating: NM_021804.3/README.md   
  inflating: NM_021804.3/ncbi_dataset/data/cds.fna  
  inflating: NM_021804.3/ncbi_dataset/data/data_report.jsonl  
  inflating: NM_021804.3/ncbi_dataset/data/dataset_catalog.json

For each gene package, you have the same folder structure. You will find the CDS FASTA in the folder ncbi_dataset/data. The CDS FASTA header has the following format:
>NM_021804.3:307-2724 ACE2 [organism=Homo sapiens] [GeneID=59272] [transcript=2] [region=cds]

To extract the metadata information you mentioned (chromosome, organism, etc), you can use the command datasets summary, which outputs a JSON report to the screen that you can parse using jq:

cat list.txt | while read GENE; do 
datasets summary gene accession "$GENE" | jq -r '[.reports[] 
| .query[],.gene.taxname,.gene.chromosomes[],.gene.symbol,.gene.gene_id, .gene.description] 
| @csv' >> gene-info.csv; 
done

I hope it helps. Please feel free to ask any questions or let me know of any issues.