Question

Batch Download Reference Proteomes from UniProt with Unix and Perl Example

0

Entering edit mode

2.9 years ago

katieostrouchov ▴ 30

What do I need to change from my looped unix code to run the perl batch download reference proteomes code from https://www.uniprot.org/help/api_downloading ? I am getting a "zsh: parse error near `done'".

I am also not sure if anything needs to be changed with the perl code or if using the TextEdit application for saving the perl code as .pl was the right approach.

I will be downloading from a list with thousands of taxids in the future. This is a small example I have tried with no success:

cat > taxids.txt
226186
345219

FILE=taxids.txt

while read line: do
perl apidownload.pl $line
done <$FILE

reference UniProt • 2.0k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 2.9 years ago by katieostrouchov ▴ 30

score 2 · Accepted Answer · 2022-01-14

2

Entering edit mode

2.9 years ago

Michael 55k

Minimally invasive fix: replace : with ;

while read line; do

perl apidownload.pl taxonomy:$line

done <$FILE

Edit: of course that requires the apidownload script to do the right thing, but simply passing the taxid as a query won't do the right thing. You can imagine that the Perl script does essentially the same as the search field on that page.

So the query should look somewhat like taxonomy:226186

try this query line perl apidownload.pl taxonomy:$line and see if it does the right thing

ADD COMMENT • link 2.9 years ago by Michael 55k

0

Entering edit mode

Not quite sure what this response means. Were there 500 results? There should only have been 1 reference. When I search Proteomes on UniProt and type 226186 or 345219, it finds the species.

I used the "Download the UniProt reference proteomes for all organisms below a given taxonomy node in compressed FASTA format" Perl example.

My output after processing the above code: Failed, got 500 Can't verify SSL peers without knowing which Certificate Authorities to trust for https://www.uniprot.org/proteomes/?query=reference:yes+taxonomy:taxonomy:226186&format=list Failed, got 500 Can't verify SSL peers without knowing which Certificate Authorities to trust for https://www.uniprot.org/proteomes/?query=reference:yes+taxonomy:taxonomy:345219&format=list

ADD REPLY • link 2.9 years ago by katieostrouchov ▴ 30

0

Entering edit mode

Looks like I need to install Mozilla::CA certificates on the command line: 500 Can't verify SSL peers without knowing which Certificate Authorities to trust . Are these trusted certificates?

Does the UniProt Perl example use edirect? If so it looks like I will need an API KEY, and I will need to request to download very high volumes in the future close to 5,000 and not 10. This is where API KEY is set up to gain certificates: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

ADD REPLY • link 2.9 years ago by katieostrouchov ▴ 30

0

Entering edit mode

Sorry yes you need to provide the full query URL. Try: perl apidownload.pl "https://www.uniprot.org/proteomes/?query=reference:yes+taxonomy:226186"

and install whatever perl modules are required.

ADD REPLY • link 2.9 years ago by Michael 55k

0

Entering edit mode

After downloading certificates with sudo cpan Mozilla::CA the follow script was successful:

FILE=taxids.txt while read line; do mkdir ./${line} perl apidownload.pl $line > ./${line} done <$FILE

However, my output from running the pl script does not end up in the ./${line} folder . What would I need to change? Modifying the perl script like so did not do the trick:

my $OutputDir = './ARGV[1]';

for my $proteome (split(/\n/, $response_list->content)) { my $file = ./$ARGV[1] $proteome . '.fasta.gz';

ADD REPLY • link 2.9 years ago by katieostrouchov ▴ 30

1

Entering edit mode

How about changing the shell script like this:

while read line; do
   mkdir -p $line # the directory could already exist
   cd $line
   perl apidownload.pl ## insert the right URL here, the script downloads into current dir
   cd ..
done <$FILE

ADD REPLY • link 2.9 years ago by Michael 55k

0

Entering edit mode

Thank you! This helped me arrive at a solution.

The full query, "https://www.uniprot.org/proteomes/?query=reference:yes+taxonomy:${line}", did not work for the search and returned "Failed, got 400 Bad Request".

However, using $line as the argument was sufficient to obtain the reference proteomes and only returned "Redundant argument in sprintf", referring to the perl code.

For future reference, the following bash code can be used to create individual folders with their corresponding taxonomy name (may contain spaces) or taxid & download the reference proteome/s from UniProt directly to those folders.

cat > taxonomyandtaxids.txt
        226186
        345219
        Dubosiella newyorkensis
        Bacteroides thetaiotaomicron

FILE=taxonomyandtaxids.txt

while read line; do
        FNAM="${line/ /_}"
        mkdir -p $FNAM
        cd $FNAM
        perl /pathtoyourperlcode/apidownload.pl $line
        cd ..
done <$FILE

For a full tutorial on how to perform this task, please visit https://github.com/kostrouc/Bioinformatics_Tutorials/

ADD REPLY • link 2.9 years ago by katieostrouchov ▴ 30