Question

Downloading Reference Proteomes Using OTU identifiers

0

Entering edit mode

3.6 years ago

katieostrouchov ▴ 30

Hi,

I am trying to download reference proteomes for a list of amplicon OTUs to generate pseudo-metagenomes. There are a couple tools, ncbi_datasets and edirect, that I have tried, but I have not had success downloading only the reference proteome sequences. I will eventually need to run this in parallel to extract and piece together many fasta files.

Concerning the conda ncbi_datasets, how would I filter to only obtain the protein.faa file? Does an --exclude function exist for this function, or might there be a way to only obtain these protein sequences?

Secondly, is --refseq the correct database I should be searching, and how does it compare to UniParc? Would obtaining the taxids using ncbi_datasets and then utilizing the API download function from UniProt be a better place to obtain these proteomes? I do not know how to proceed. Below is my code and output.

Here is my example datasets code I have been using to start by obtaining only one OTU's proteome/s:

datasets download genome taxon "Bacteroides thetaiotaomicron" --exclude-gff3 --exclude-rna --exclude-seq --exclude-genomic-cds --refseq --reference --dehydrated --filename Bacteriodesthetaiomicron.zip\
unzip Bacteriodesthetaiomicron.zip -d Bacteriodesthetaiomicron\
datasets rehydrate --directory Bacteriodesthetaiomicron

Here is the result:

Found 2 files for rehydration
Completed 2 of 2 [================================================] 100%
Downloading: Bacteriodesthetaiomicron/ncbi_dataset/data/GCF_014131755.1/protein.faa    2.23MB done
Downloading: Bacteriodesthetaiomicron/ncbi_dataset/data/GCF_014131755.1/sequence_report.jsonl    424B done

(ncbi_datasets) % ls

Bacteriodesthetaiomicron/     Bacteriodesthetaiomicron.zip

(ncbi_datasets) % cd Bacteriodesthetaiomicron/

(ncbi_datasets)  Bacteriodesthetaiomicron % ls

README.md     ncbi_dataset/

(ncbi_datasets) Bacteriodesthetaiomicron % cd ncbi_dataset 

(ncbi_datasets) ncbi_dataset % ls

data/      fetch.txt

(ncbi_datasets) ncbi_dataset % cd data 

(ncbi_datasets) data % ls

GCF_014131755.1/            assembly_data_report.jsonl  dataset_catalog.json

(ncbi_datasets) data % cd GCF_014131755.1 

(ncbi_datasets) GCF_014131755.1 % ls

protein.faa            sequence_report.jsonl

genome datasets proteome command-line 16S • 1.0k views

ADD COMMENT • link updated 3.6 years ago by GenoMax 153k • written 3.6 years ago by katieostrouchov ▴ 30

score 2 · Accepted Answer · 2022-01-06

how would I filter to only obtain the protein.faa file?

Isn't that what you obtained above? Folder hierarchy that the data is downloaded in is part of how datasets works.

Since datasets tool downloads all files with generic name protein.faa you will want to use the instructions here to rename the files: NCBI datasets bulk protein fasta download

Take a look at alternate tools (click --> NCBI datasets bulk protein fasta download ) for alternatives to NCBI datasets.