Question

NCBI datasets bulk protein fasta download

2

Entering edit mode

3.0 years ago

emawhitt ▴ 20

Hi,

I want to download protein fasta files for a set of bird species. I have the genome assembly accessions in a file. I feel like every time I need to bulk download fasta files I've forgotten how I did it last time and the databases have all changed their websites/interfaces. I used the NCBI databases command line to download the files. However, datasets gives each accession its own folder containing "protein.faa". What I want is a single folder with fasta files so I can then use this in Orthofinder and other programmes. It's essentially useless to have a few hundred folders containing a file with the same name. Does anyone know the best way to download these files (and a way that will remain the best way and I can use it again in the future) or figure out how to use the downloads from datasets? Thank you.

NCBI datasets • 3.0k views

ADD COMMENT • link updated 3.0 years ago by MirianT_NCBI ▴ 760 • written 3.0 years ago by emawhitt ▴ 20

0

Entering edit mode

Assuming you are on Linux, if all your downloaded directories are in /home/emawhitt/fasdls then simply run:

mv /home/emawhitt/fasdls/*/*.faa /home/emawhitt/fasdls

This will move all the .faa files to /home/emawhitt/fasdls.

ADD REPLY • link 3.0 years ago by Dunois ★ 2.8k

0

Entering edit mode

thank you. Is there some way to rename the files too? All the files are called protein.faa

ADD REPLY • link 3.0 years ago by emawhitt ▴ 20

0

Entering edit mode

You could look at:

https://stackoverflow.com/questions/16266930/how-to-rename-files-in-folders-to-foldername-using-batch-file
https://askubuntu.com/questions/746860/rename-a-file-to-parent-directorys-name-in-terminal
https://askubuntu.com/questions/759422/rename-files-adding-their-parent-folder-name

ADD REPLY • link 3.0 years ago by GenoMax 147k

0

Entering edit mode

This would move and rename the files at once (might be a bit slow depending on how many files you have). Just replace /home/emawhitt/fasdls that is assigned to MYPATH right now with whatever is the path to the directory containing all the protein.faa files (in their respective sub-directories).

MYPATH="/home/emawhitt/fasdls"; cd ${MYPATH}; find . -maxdepth 2 -type f -name "*.faa" -exec sh -c 'DIR=$(basename $(dirname "{}")); mv "{}" ./${DIR}_protein.faa' \;

MirianT_NCBI 's solution down below might be a bit faster though.

ADD REPLY • link 3.0 years ago by Dunois ★ 2.8k

score 3 · Answer 1 · 2021-12-08

3

Entering edit mode

3.0 years ago

MirianT_NCBI ▴ 760

From the ncbi_dataset folder, you can run this one-liner:

mkdir proteins; for f in data/*/protein.faa; do out=$( echo $f | cut -f2 -d'/'); cp $f proteins/${out}.faa; done

This command will create a folder proteins, and copy each protein.faa file to the folder proteins while renaming them with the respective genome accession number. Let me know if that's helpful. If you need something different, I'll be happy to help with that too.

ADD COMMENT • link 3.0 years ago by MirianT_NCBI ▴ 760

0

Entering edit mode

This worked perfectly. Thank you!

ADD REPLY • link 3.0 years ago by emawhitt ▴ 20

score 0 · Answer 2 · 2021-12-08

0

Entering edit mode

3.0 years ago

GenoMax 147k

Not sure what "databases command line" you used but consider downloading the data using ncbi-genome-download tool (LINK) or genome_updater (LINK). This should avoid "all files named protein.faa" issue.

ADD COMMENT • link 3.0 years ago by GenoMax 147k

0

Entering edit mode

Thank you, I will take a look at those links. I used the NCBI dataset command line tool (LINK).

ADD REPLY • link 3.0 years ago by emawhitt ▴ 20