Question

problem with setting up uniprot database for Diamond BLAST

0

Entering edit mode

3.9 years ago

slin023 • 0

Hello, I have encountered a "uniprot database" problem for diamond blast . I have question about these commands on the tutorial [https://blobtoolkit.genomehubs.org/install/, https://github.com/blobtoolkit/blobtools2/issues/6], but those don't seem to be working for me:

# extract and concatenate protein FASTA files
touch reference_proteomes.fasta.gz
find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz

I did not receive any error message,but it just creates a "reference_proteomes.fasta.gz" with 0B, so reference_proteomes.fasta.gz created by this command is pretty much empty (see the screencap) enter image description here

Here are all the list of " _.fasta.gz" looks like in the "uniprot" folder: enter image description here

Any suggestion for how to revise this command based upon the file names :?

find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz"

Please let me know, thank you!

genome Assembly blobtool • 2.1k views

ADD COMMENT • link 3.9 years ago by slin023 • 0

0

Entering edit mode

Did you run these commands successfully before:

mkdir -p uniprot

wget -q -O uniprot/reference_proteomes.tar.gz \ ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/$(curl \ -vs ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/ 2>&1 | \ awk '/tar.gz/ {print $9}')

cd uniprot

tar xf reference_proteomes.tar.gz

Are you inside the folder uniprot and are you using Linux command-line?

ADD REPLY • link 3.9 years ago by antonioggsousa 3.2k

0

Entering edit mode

Yes, this runs successfully, the file name you see on the pic is after I tar xf reference_proteomes.tar.gz , the other commands echo "accession\taccession.version\ttaxid\tgi" > reference_proteomes.taxid_map zcat */*.idmapping.gz | grep "NCBI_TaxID" | awk '{print $1 "\t" $1 "\t" $3 "\t" 0}' >> reference_proteomes.taxid_map also works for me

ADD REPLY • link 3.9 years ago by slin023 • 0

0

Entering edit mode

Do the following command:

find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | wc -l

This will tell if you are finding any files to cat next and how many files.

ADD REPLY • link 3.9 years ago by antonioggsousa 3.2k

0

Entering edit mode

I typed it, and it shows " 0 ".

ADD REPLY • link 3.9 years ago by slin023 • 0

1

Entering edit mode

So, that is your problem. You are not getting any fasta files without DNA or additional.

If you search in the subfolders (at least below 2 subfolders) of the uniprot folder do you see fasta.gz files without DNA or additional in their names?

It seems to me that you have from the print, so, I don't know why the command is failing...

Do:

find . -mindepth 2 | grep "fasta.gz" | head

ADD REPLY • link 3.9 years ago by antonioggsousa 3.2k

0

Entering edit mode

Those subfolders are all empty. I took all the at least 50~60k files out of all sub folders. And yes, there are some fasta.gz files without DNA or additional names in it. Take UP000326979_1803180_ tax ID as example:

UP000326979_1803180_DNA.fasta.gz
UP000326979_1803180.fasta.gz
UP000326979_1803180.gene2acc.gz
UP000326979_1803180.idmapping.gz

However, I tried your command, it seems working now, not empty anymore: enter image description here thank you very much for your help!

ADD REPLY • link 3.9 years ago by slin023 • 0