Hi,
I am trying to download reference proteomes for a list of amplicon OTUs to generate pseudo-metagenomes. There are a couple tools, ncbi_datasets and edirect, that I have tried, but I have not had success downloading only the reference proteome sequences. I will eventually need to run this in parallel to extract and piece together many fasta files.
Concerning the conda ncbi_datasets, how would I filter to only obtain the protein.faa file? Does an --exclude function exist for this function, or might there be a way to only obtain these protein sequences?
Secondly, is --refseq the correct database I should be searching, and how does it compare to UniParc? Would obtaining the taxids using ncbi_datasets and then utilizing the API download function from UniProt be a better place to obtain these proteomes? I do not know how to proceed. Below is my code and output.
Here is my example datasets code I have been using to start by obtaining only one OTU's proteome/s:
datasets download genome taxon "Bacteroides thetaiotaomicron" --exclude-gff3 --exclude-rna --exclude-seq --exclude-genomic-cds --refseq --reference --dehydrated --filename Bacteriodesthetaiomicron.zip\
unzip Bacteriodesthetaiomicron.zip -d Bacteriodesthetaiomicron\
datasets rehydrate --directory Bacteriodesthetaiomicron
Here is the result:
Found 2 files for rehydration
Completed 2 of 2 [================================================] 100%
Downloading: Bacteriodesthetaiomicron/ncbi_dataset/data/GCF_014131755.1/protein.faa 2.23MB done
Downloading: Bacteriodesthetaiomicron/ncbi_dataset/data/GCF_014131755.1/sequence_report.jsonl 424B done
(ncbi_datasets) % ls
Bacteriodesthetaiomicron/ Bacteriodesthetaiomicron.zip
(ncbi_datasets) % cd Bacteriodesthetaiomicron/
(ncbi_datasets) Bacteriodesthetaiomicron % ls
README.md ncbi_dataset/
(ncbi_datasets) Bacteriodesthetaiomicron % cd ncbi_dataset
(ncbi_datasets) ncbi_dataset % ls
data/ fetch.txt
(ncbi_datasets) ncbi_dataset % cd data
(ncbi_datasets) data % ls
GCF_014131755.1/ assembly_data_report.jsonl dataset_catalog.json
(ncbi_datasets) data % cd GCF_014131755.1
(ncbi_datasets) GCF_014131755.1 % ls
protein.faa sequence_report.jsonl