Align Sequentially All Files In A Protein Directory
3
1
Entering edit mode
12.8 years ago
Louis ▴ 50

Thank you for reading my question. I am attempting to create multiple MUSCLE alignments for thousands of fasta in a directory.

The input would be: vbro1.fasta, vbro2.fasta... vbro6405.fasta; where a given file commonly contains 20 or more proteins.

The output would be: vbro1.afa, vbro2.afa... vbro6405.afa

My first thought was a for loop in bash like this. My second thought was to run this in biopython's MUSCLE wrapper. Either way, I'm concerned that the job is too big and that I'll need to write a script that will interate through the multifasta in more manageable pieces.

for i in outdir; do echo "muscle -in file${i} -out ${i}.afa -maxiters 1 -diags1 -sv" done

I would greatly appreciate some assistance if someone could point me in the right direction.

multiple • 5.5k views
ADD COMMENT
5
Entering edit mode
12.8 years ago
Andreas ★ 2.5k

As an alternative, here a bash solution, based on your initial attempt:

for f in $(ls your-input-dir/vbro*.fasta); do
  muscle -in $f -out ${f%.fasta}.afa -maxiters 1 -diags1 -sv;
done

If you want to run things in parallel, then you could simply echo the muscle commands, redirect into a file and then use xargs to use the amount of processors you like. For example if your Muscle commands are listed in cmd.txt and you want to run four jobs in parallel:

cat cmd.txt | xargs -I cmd --max-procs=4 bash -c cmd &

Andreas

ADD COMMENT
1
Entering edit mode

Nice! I like to see bash solutions. It is possible to avoid the creation of the command file: find . -name "vbro.fasta" -type f -print0 | xargs -0 --max-procs=4 -L1 -I FILE muscle -in FILE -out FILE.afa -maxiters 1 -diags1 -sv Caveat: output file names are in .fasta.afa

ADD REPLY
0
Entering edit mode

Also and excellent solution that works very well! And thank you for showing me how to run the jobs in parallel.

ADD REPLY
0
Entering edit mode

I must say I learned new stuff about bash thanks to you.

ADD REPLY
3
Entering edit mode
12.8 years ago
Darked89 4.7k

Maybe this:

#!/usr/bin/env python

import glob, os

muscle_options = " -maxiters 1 -diags1 -sv "

for fasta_file in glob.glob("vbro*.fasta"):
    afa_file = fasta_file[:-5] + "afa"
    command = "muscle -in %s -out %s %s" % (fasta_file, afa_file, muscle_options)
    print "executing: " + command
    os.system(command)

Muscle is quite fast. If you need to split your muscle jobs you may tweak "vbro.fasta" and submit 10 jobs ("vbro0.fasta", "vbro*1.fasta" etc.) as a quick and dirty solution.

ADD COMMENT
0
Entering edit mode

Thank you! Python's glob module is totally new to me. I ran this on a subset of 10 fasta and it worked like a charm. I'm inspecting those alignments now and if all look tight I'll run your script on the entire data set. This is especially useful bc I can use versions of this as I move forward TrimAl and RAxML.

ADD REPLY
0
Entering edit mode
6.9 years ago
ATCG ▴ 400

Hi, I have 366 RNAseq samples, reads for each sample are located in individual folders with the following name structure:SAME-DIFFERENT. I would like to write a loop to align all samples using the STAR aligner. Any help is deeply appreciated.

ADD COMMENT

Login before adding your answer.

Traffic: 1794 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6