Downloading bulk coding sequences from NCBI that include the organism name in the sequence header?
1
0
Entering edit mode
18 months ago
Em ▴ 20

Hi everyone!

I've tried searching for a possible solution to this for a little bit but haven't had any luck... However, I know there has to be someone out there with one!

I'm interested in downloading only the coding sequences for hundreds of genes in a FASTA nucleotide format from NCBI. I know it's easy to just download the coding sequences by using the 'send to:' feature. However, I need more information in the FASTA header, such as the source organism name and the chromosome.

I've tried using the following command for NCBI EDirect. I was able to retrieve the organism name, but I'm extracting the entire gene sequence - not the coding sequence:

$ efetch -db nuccore -id NM_001354526.1 -format gpc | xtract -pattern INSDSeq -element INSDSeq_organism INSDSeq_sequence

It seems that efetch is likely the best option, but I'm having a bit of difficulties with the syntax.

Thanks so much in advance, and please let me know if more context is needed.

efetch edirect ncbi • 2.2k views
ADD COMMENT
0
Entering edit mode

How about this (sequence truncated for space) :

$ efetch -db nuccore -id NM_001354526.1 -format fasta_cds_na
>lcl|NM_001354526.1_cds_NP_001341455.1_1 [gene=UBE3A] [db_xref=CCDS:CCDS32177.1] [protein=ubiquitin-protein ligase E3A isoform 1] [protein_id=NP_001341455.1] [location=663..3221] [gbkey=CDS]
ATGAAGCGAGCAGCTGCAAAGCATCTAATAGAACGCTACTACCACCAGTTAACTGAGGGCTGTGGAAATG
AAGCCTGCACGAATGAGTTTTGTGCTTCCTGTCCAACTTTTCTTCGTATGGATAATAATGCAGCAGCTAT
TAAAGCCCTCGAGCTTTATAAGATTAATGCAAAACTCTGTGATCCTCATCCCTCCAAGAAAGGAGCAAGC
TCAGCTTACCTTGAGAACTCGAAAGGTGCCCCCAACAACTCCTGCTCTGAGATAAAAATGAACAAGAAAG
GCGCTAGAATTGATTTTAAAGATGTGACTTACTTAACAGAAGAGAAGGTATATGAAATTCTTGAATTATG
TAGAGAAAGAGAGGATTATTCCCCTTTAATCCGTGTTATTGGAAGAGTTTTTTCTAGTGCTGAGGCATTG

this does not include organism name in the header though.

You could run two commands to get the organism name and the perhaps write a script to incorporate it into the header.

$ efetch -db nuccore -id NM_001354526.1 -format docsum | xtract -pattern DocumentSummary -element Organism; efetch -db nuccore -id NM_001354526.1 -format fasta_cds_na
Homo sapiens
>lcl|NM_001354526.1_cds_NP_001341455.1_1 [gene=UBE3A] [db_xref=CCDS:CCDS32177.1] [protein=ubiquitin-protein ligase E3A isoform 1] [protein_id=NP_001341455.1] [location=663..3221] [gbkey=CDS]
ATGAAGCGAGCAGCTGCAAAGCATCTAATAGAACGCTACTACCACCAGTTAACTGAGGGCTGTGGAAATG
AAGCCTGCACGAATGAGTTTTGTGCTTCCTGTCCAACTTTTCTTCGTATGGATAATAATGCAGCAGCTAT
ADD REPLY
0
Entering edit mode

hey GenoMax, thanks for your response! So sorry that I was delayed in writing back, the cluster I use was down for a bit so I moved on.. However, I did try what you suggested. Not sure if it's user error, but I wasn't able to get the docsum command to work. Using gpc as the format seemed to work best for me.

$ efetch -db nuccore -id NM_001354526.1 -format gpc | xtract -pattern INSDSeq -element INSDSeq_organism INSDSeq_accession-version; efetch -db nuccore -id NM_001354526.1 -format fasta_cds_na

Mirian's suggestion hit the nail on the head, so no script writing for me - woohoo! thanks again for your comment though! :)

ADD REPLY
5
Entering edit mode
18 months ago
MirianT_NCBI ▴ 760

Hi,

You can use NCBI Datasets for this task.

Using the NCBI Datasets gene option, you can download using a list of accessions (you mentioned hundreds of genes). Assumptions:

  • You want to have each CDS as a separate FASTA file (instead of CDS from different genes all in the same FASTA). I"m using a loop for that
  • You're working from a new folder (let's call it all-genes)

I'm using a txt file with two gene accessions (one per line):

cat list.txt
NM_021804.3
NM_001354526.1

Here's the example:

cat list.txt | while read GENE; do
datasets download gene accession "$GENE" --include cds --filename "$GENE".zip;
done

By default, when a user requests a download by gene accession, NCBI Datasets assumes that the user wants all sequences under the same gene-id. For example: for the accession you provided (NM_001354526.1), there are 68 sequences in the CDS FASTA. If you want to restrict your download to only the accession you searched for, you can add the the flag -fasta-filter to the download command. Like this:

cat list.txt | while read GENE; do
datasets download gene accession "$GENE" --include cds --fasta-filter "$GENE" --filename "$GENE".zip;
done

After downloading the gene data packages, you can unzip them and you will find this folder structure:

for f in *.zip; do unzip $f -d ${f/.zip/}; done
Archive:  NM_001354526.1.zip
  inflating: NM_001354526.1/README.md  
  inflating: NM_001354526.1/ncbi_dataset/data/cds.fna  
  inflating: NM_001354526.1/ncbi_dataset/data/data_report.jsonl  
  inflating: NM_001354526.1/ncbi_dataset/data/dataset_catalog.json  
Archive:  NM_021804.3.zip
  inflating: NM_021804.3/README.md   
  inflating: NM_021804.3/ncbi_dataset/data/cds.fna  
  inflating: NM_021804.3/ncbi_dataset/data/data_report.jsonl  
  inflating: NM_021804.3/ncbi_dataset/data/dataset_catalog.json

For each gene package, you have the same folder structure. You will find the CDS FASTA in the folder ncbi_dataset/data. The CDS FASTA header has the following format:
>NM_021804.3:307-2724 ACE2 [organism=Homo sapiens] [GeneID=59272] [transcript=2] [region=cds]

To extract the metadata information you mentioned (chromosome, organism, etc), you can use the command datasets summary, which outputs a JSON report to the screen that you can parse using jq:

cat list.txt | while read GENE; do 
datasets summary gene accession "$GENE" | jq -r '[.reports[] 
| .query[],.gene.taxname,.gene.chromosomes[],.gene.symbol,.gene.gene_id, .gene.description] 
| @csv' >> gene-info.csv; 
done

I hope it helps. Please feel free to ask any questions or let me know of any issues.

ADD COMMENT
2
Entering edit mode

Thanks Mirian! This is definitely the solution that I was looking for, I appreciate your help! :-)

ADD REPLY
0
Entering edit mode

Apologies for this followup question but this seems like a good place to ask.

The design philosophy behind creating these "download packages" may be useful when user wants more than one piece of information. On the other hand if a user wants a single file (",fna") why not simply make that available directly with perhaps an additional command line switch. This creation of folders and the addition hassle of parsing that comes with that seems like an added burden for datasets.

ADD REPLY
0
Entering edit mode

Thanks for the question. We acknowledge that certain users may perceive the file structure as a hassle (for this example in particular, I suggested a loop to download each gene as a separate data package, which is not truly necessary since datasets can accept a list of genes for download). However, it is important to package sequence and metadata together to enhance attribution and facilitate reusability. Hence, every data package downloaded through NCBI Datasets includes a basic metadata report for this purpose. We're under constant development and always looking for feedback to improve our tools. If you have any other thoughts on NCBI Datasets, we would love to hear that.

ADD REPLY

Login before adding your answer.

Traffic: 2053 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6