How To Blast Hundreds Of Fasta Files Using Local Database And Perl
2
3
Entering edit mode
11.1 years ago

I have to blast a zip file including hundreds of protein fasta files. Since it is impossible to blast them one by one, I plan to use perl to blast in a local database. I am new to perl and look for models online. My question is how to unite a unzip perl with a blast perl.

  1. Is $name in unzip.pl the result of unzipped fasta? what should I do at # Do something here ?
  2. $f in BlastList.pl is the query. How could I change it to connect result of unzip.pl.
  3. Is any other solution better than this one? Thank you

Here is unzip.pl .

my $zipfile = "fasta.zip";
my $u = new IO::Uncompress::Unzip $zipfile
or die "Cannot open $zipfile: $UnzipError";
my $status;
for ($status = 1; $status > 0; $status = $u->nextStream())
{
my $name = $u->getHeaderInfo()->{Name};
warn "Processing member $name\n" ;
my $buff;
while (($status = $u->read($buff)) > 0) {
**# Do something here**
}
last if $status < 0;
}
die "Error processing $zipfile: $!\n"
if $status < 0 ;

Here is the blast model using local database: BlastList.pl.

$DBNAME = "mydb";   
$dirtoget="C:/Blast";
opendir(IMD, $dirtoget) || die("Cannot open directory");
# delete the old "DosBlast.bat" 
$dosfile = "DosBlast.bat";
unlink($dosfile);
# Get the list of the new sequence files to blast
@thefiles= readdir(IMD);
closedir(IMD);
 # Create a new file "DosBlast.bat" 
 open(OUT,">DosBlast.bat") || die "cannot open file for writing: $!";
 foreach $f (@thefiles) 
 {
  unless ( ($f eq ".") || ($f eq "..")  || ($f eq "DosBlast.bat") || ($f eq "BlastList.pl") || ($f eq $DBNAME)||
 ($f eq "BlastOut") || ($f eq "blastall.exe") || ($f eq "formatdb.exe")|| 
 ($f eq "formatdb.log")|| ($f eq "ReadMe.doc")){ 
  @myarray = split(/\./,$f);   # Old file name
  $extension =".txt";         # This is the new file extension
  @newname=@myarray[0].$extension;
   print(OUT "blastp -db $DBNAME -query $f -out @newname -num_descriptions 1");
   } # end of unless
   } # end of foreach
perl blast+ • 7.6k views
ADD COMMENT
3
Entering edit mode

"Since it is impossible to blast them one by one" : why ?

ADD REPLY
0
Entering edit mode

From the looks of this code I think you are really lost. I don't mean that to be negative, but I'm being honest when I say you cannot improve this in any way except taking another approach. Try to simplify things. Take a step back and try to get the blast command working on a single file. When that is working, try to do a simple procedure on the zipped directory like counting the files. If you can accomplish those things, then it should be possible to put things together like Eric Normandeau suggested using a shell command. This way you can build up your command and not worry about also getting the Perl syntax right.

ADD REPLY
0
Entering edit mode

Looks like they are making a BAT file because they are in a windows environment. The perl script is just a way of generating that, which is what they will actually run. Its not a terrible idea, but it would probably just be simpler to, instead of writing out to a bat file, make a direct system call to blastp. But to be honest I haven't done this sort of thing in a Windows setting other than cygwin.

ADD REPLY
0
Entering edit mode

It's not a bad idea, but instead of dealing with two monsters at once, blast and perl, I think it would be advisable to just work out the blast steps since that is the goal. It looks like the Perl aspect of this task is really a stumbling block and it may be better to remove that from the equation for now. At least, until it's clear the sequences can be processed and blasted correctly. Another possibility on windows would be to use bioperl's run package to handle the blast and use bioperl to parse the results.

ADD REPLY
0
Entering edit mode

Thank you all for the suggestions. Frankly, I was lost when I met the problem. What I could do was to blast them one by one manually.

ADD REPLY
4
Entering edit mode
11.1 years ago

Like Pierre suggests, you certainly can blast all the files separately. Also, your approach seems way too convoluted. Why not simply use the blast program directly?

EDIT: These solutions will work under Linux or MacOSX but not Windows.

If you can unzip your sequence archive and have a folder full of fasta files, the following will blast all the files:

for i in *.fasta; do blastp -db /path/do/db -query $i -out $i.blasted; done

This supposes only one folder with all the files with a .fasta extension.

You can do better with gnu parallel if you want (here is Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them by its creator):

find . -iname "*.fasta" | parallel blastp -db /path/do/db -query {} -out {.}.blasted

This still supposes you have .fasta extensions (which you can change in the command) but will blast all sequence files in your directory, even if they are nested in subdirectories. (It may also suppose you don't have spaces in your directory or file names)

I'm sure someone here will know how to do the same without unziping your archive. (Could somebody show us how to do this?)

ADD COMMENT
0
Entering edit mode

This is the approach I would take although it looks like the user is in a Windows environment. Unless he is using cygwin I'm not sure how command-line blast may differ on a straight windows system. But it could probably be wrapped inside a perl script basically how you did above as well.

I think there is no way around unzipping the archive. Also it would be good to know whether the fasta files contain only a single sequence, or multiple sequences. If they are only a single sequence the OP is better off putting them in to one file and running a parallel BLAST from the getgo and parsing the results out as needed afterwards.

ADD REPLY
0
Entering edit mode

I have overlooked the OS in the question. My mistake. I don't know what the solution would be in Windows. If it has to be half as complicated as the OP has started to produce, and if he/she is likely to have to play with sequences like that in his/her project, I think it should be enough incentive for that person to start learning and using Linux for his project.

ADD REPLY
0
Entering edit mode

Yes. my os is windows. I tried the Eric's suggestions. Neither works. I unzipped my sequence archive and have a folder full of fasta files. If I don't use perl and use blastp directly, how blastp -query could recognized hundreds of fasta files in a folder?

By the way, I really think of changing a computer of Linux or Unix system. Any other suggestions?

ADD REPLY
0
Entering edit mode

Install and try Ubuntu or try a Mac computer. Ubuntu is one of the easiest Linux distributions to start with. It is not a limited Linux in any way, only simple to use. This should give you much more power to do what you are trying to do. It will require some learning though, but in the medium and long term it should improve your capacities greatly.

ADD REPLY
0
Entering edit mode

Aside from Eric's suggestion, another option is to run Ubuntu within a VM. This depends on how many resources you have on your system; if it's fairly decent then set up a VM (maybe using virtualbox).

You can also set up and run Ubuntu very quickly on Amazon EC2 (AWS). I thought someone out there had an AMI with everything set up on it, such as CloudBioLinux.

ADD REPLY
0
Entering edit mode

I don't see how a VM is 'aside from my suggestion', still counts as an install to me :) From my perspective of getting into AWS in the last year and not finding it transparent or straight forward, I suspect suggesting to someone with no programming or 'NIX experience to set up and run Ubuntu on EC2, could potentially sound scary.

ADD REPLY
0
Entering edit mode

Eric, apologies, but your suggestion re: Ubuntu above makes no mention of using a VM, and so at least to me this indicated a complete reinstallation or at the very least a dual-boot.

So, from my vantage point I'm not sure why the suggestion of jumping right in and installing Ubuntu over one's current OS wouldn't sound scary to someone who likely has little UNIX/scripting experience, yet using a VM or suggesting AWS (neither of which has a downside beyond some possible cost for AWS) is? Maybe I'm reading too much into this?

Just so it's clear, I'm really not disagreeing with you, I just think the options need to be clarified. I completely agree, they should try Ubuntu (or Fedora or <insert linux="" flavor="" here="">) out. However, based on all the modern options available, I don't think there is an immediate need to jump into the deep end of the pool until they test the waters a bit and feel comfortable enough to manage a full linux install on their own.

ADD REPLY
0
Entering edit mode

Hi Chris, no apologies are needed at all. I was just feeling mischievous and wanted to bend reality a bit by including installing Ununtu inside a VM in the bigger category of installing Ubuntu, just to pretend I had already made the suggestion. :) I agree that jumping into the Linux pool can be scary. The easiest thing to do I think is to use a bootable CD or even better USB stick. Even creating these booting media can be scary for some though. I do not advise replacing one's OS with Linux as a first move either, so it looks like we agree 100% here!

ADD REPLY
0
Entering edit mode

blastp will only handle one file at a time. However, my earlier comments asked about why the files need to be separate fasta files in the first place? Are there more than one sequence per fasta file? Why are they divided into multiple files in the first place if you need to run blast on all of the sequences against the same database? Sometimes there are very good reasons for this of course, but they might not need to be split up in to hundreds of files.

ADD REPLY
0
Entering edit mode
11.1 years ago
DG 7.3k

Out of curiosity are there multiple sequences per fasta file or just one? If there is no compelling reason for having them all as individual files you can put them into one large file and let blastp handle the parallelism itself and just run the blast. If you are using BioPerl you can use that to very easily parse your results on a query by query basis however you like.

ADD COMMENT
0
Entering edit mode

I am new to bioinformatics. How to use bioperl to solve the problem? Thank you.

ADD REPLY
2
Entering edit mode

If you are new to bioinformatics but have no scripting experience (e.g. perl, python, or ruby) or UNIX background then the various bio* won't help (I speak as one of the core developers of bioperl). If you are doing anything high-throughput and plan on doing this on your own, you need to look into taking some short course on the above. Might I recommend Software Carpentry?

ADD REPLY
0
Entering edit mode

I agree. Get your feet wet with tutorials on generic coding with Software Carpentry or Code Academy if you lack experience. Then perhaps try the BioPerl tutorials for processing different types of data. Parsing BLAST output is one of the beginning tutorials if I recall correctly.

ADD REPLY
0
Entering edit mode

Thank you. I 'll try.

ADD REPLY

Login before adding your answer.

Traffic: 4404 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6