Hello, I have encountered a "uniprot database" problem for diamond blast . I have question about these commands on the tutorial [https://blobtoolkit.genomehubs.org/install/, https://github.com/blobtoolkit/blobtools2/issues/6], but those don't seem to be working for me:
# extract and concatenate protein FASTA files
touch reference_proteomes.fasta.gz
find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz
I did not receive any error message,but it just creates a "reference_proteomes.fasta.gz" with 0B, so reference_proteomes.fasta.gz created by this command is pretty much empty (see the screencap)
Here are all the list of " _.fasta.gz" looks like in the "uniprot" folder:
Any suggestion for how to revise this command based upon the file names :?
find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz"
Please let me know, thank you!
Did you run these commands successfully before:
Are you inside the folder
uniprot
and are you using Linux command-line?Yes, this runs successfully, the file name you see on the pic is after I
tar xf reference_proteomes.tar.gz
, the other commandsecho "accession\taccession.version\ttaxid\tgi" > reference_proteomes.taxid_map zcat */*.idmapping.gz | grep "NCBI_TaxID" | awk '{print $1 "\t" $1 "\t" $3 "\t" 0}' >> reference_proteomes.taxid_map
also works for meDo the following command:
find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | wc -l
This will tell if you are finding any files to
cat
next and how many files.I typed it, and it shows " 0 ".
So, that is your problem. You are not getting any fasta files without
DNA
oradditional
.If you search in the subfolders (at least below 2 subfolders) of the
uniprot
folder do you seefasta.gz
files withoutDNA
oradditional
in their names?It seems to me that you have from the print, so, I don't know why the command is failing...
Do:
find . -mindepth 2 | grep "fasta.gz" | head
Those subfolders are all empty. I took all the at least 50~60k files out of all sub folders. And yes, there are some
fasta.gz
files without DNA or additional names in it. TakeUP000326979_1803180_
tax ID as example:However, I tried your command, it seems working now, not empty anymore: thank you very much for your help!