The challenge is to use unix pipes (|) to process it from the compressed files all the way to makeblastdb, via sed and/or awk.
I'll post my one-liner once it is done.
ADD REPLY
• link
updated 2.1 years ago by
Ram
44k
•
written 9.3 years ago by
Eliad
▴
90
1
Entering edit mode
I had a look. They are basically genbank files. Here's what I came up with. Line begins with "ID" - print the second column. Line begins with space - print the entire line. Ignore all other lines. Delete spaces.
# Purpose:
# Read a file in SP format, write it in FASTA format.
#
# Usage:
# sp_to_fasta SP_file > FASTA_file
use strict;
use IO::File;
use SWISS::Entry;
my $inputfile = @ARGV[0];
my $fh = new IO::File $inputfile or
die "Cannot open input file $inputfile: $!";
$/ = "\n\/\/";
while(<$fh>) {
s/\r//g;
(my $entry_txt = $_) =~ s/^\s+//;
next unless $entry_txt;
$entry_txt .= "\n";
my $entry = SWISS::Entry->fromText( $entry_txt );
print $entry->toFasta();
}
You can also try to split your download, for instance to get entries by group of 500:
numentries=XXXX # You have to do a first query on the website to see the number of entries
for I in `seq 0 500 $numentries`; do
wget -O uniprot_$i.fasta "http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=taxonomy:bacteria&fil=&limit=500&force=no&preview=true&format=fasta&offset=$i"
done
To create a query, just do a search on the uniprot website, click on download->preview and copy the URL
Thanks, but these are not genbank files.
These are UniProt Knowledgebase database files as described here: http://web.expasy.org/docs/userman.html#convent
I think I'll just parse these myself.
The challenge is to use unix pipes (
|
) to process it from the compressed files all the way to makeblastdb, via sed and/or awk.I'll post my one-liner once it is done.
I had a look. They are basically genbank files. Here's what I came up with. Line begins with "ID" - print the second column. Line begins with space - print the entire line. Ignore all other lines. Delete spaces.
Would be more elegant if spaces were deleted in the awk command but whatever.
But maybe you would be more interested in the "AC" (accession) than "ID" lines as templates for fasta headers. I don't know..
Thanks!
I polished it a little:
So the whole thing looks like this in shell script:
Newbie question: How do I accept your second comment as the answer?