Question

Align Sequentially All Files In A Protein Directory

1

Entering edit mode

12.8 years ago

Louis ▴ 50

Thank you for reading my question. I am attempting to create multiple MUSCLE alignments for thousands of fasta in a directory.

The input would be: vbro1.fasta, vbro2.fasta... vbro6405.fasta; where a given file commonly contains 20 or more proteins.

The output would be: vbro1.afa, vbro2.afa... vbro6405.afa

My first thought was a for loop in bash like this. My second thought was to run this in biopython's MUSCLE wrapper. Either way, I'm concerned that the job is too big and that I'll need to write a script that will interate through the multifasta in more manageable pieces.

for i in outdir; do echo "muscle -in file${i} -out ${i}.afa -maxiters 1 -diags1 -sv" done

I would greatly appreciate some assistance if someone could point me in the right direction.

multiple • 5.5k views

ADD COMMENT • link updated 6.9 years ago by ATCG ▴ 400 • written 12.8 years ago by Louis ▴ 50

score 5 · Answer 1 · 2012-02-07

5

Entering edit mode

12.8 years ago

Andreas ★ 2.5k

As an alternative, here a bash solution, based on your initial attempt:

for f in $(ls your-input-dir/vbro*.fasta); do
  muscle -in $f -out ${f%.fasta}.afa -maxiters 1 -diags1 -sv;
done

If you want to run things in parallel, then you could simply echo the muscle commands, redirect into a file and then use xargs to use the amount of processors you like. For example if your Muscle commands are listed in cmd.txt and you want to run four jobs in parallel:

cat cmd.txt | xargs -I cmd --max-procs=4 bash -c cmd &

Andreas

ADD COMMENT • link 12.8 years ago by Andreas ★ 2.5k

1

Entering edit mode

Nice! I like to see bash solutions. It is possible to avoid the creation of the command file: find . -name "vbro.fasta" -type f -print0 | xargs -0 --max-procs=4 -L1 -I FILE muscle -in FILE -out FILE.afa -maxiters 1 -diags1 -sv Caveat: output file names are in .fasta.afa

ADD REPLY • link 12.8 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

Also and excellent solution that works very well! And thank you for showing me how to run the jobs in parallel.

ADD REPLY • link 12.8 years ago by Louis ▴ 50

0

Entering edit mode

I must say I learned new stuff about bash thanks to you.

ADD REPLY • link 12.8 years ago by Darked89 4.7k

score 3 · Answer 2 · 2012-02-06

3

Entering edit mode

12.8 years ago

Darked89 4.7k

Maybe this:

#!/usr/bin/env python

import glob, os

muscle_options = " -maxiters 1 -diags1 -sv "

for fasta_file in glob.glob("vbro*.fasta"):
    afa_file = fasta_file[:-5] + "afa"
    command = "muscle -in %s -out %s %s" % (fasta_file, afa_file, muscle_options)
    print "executing: " + command
    os.system(command)

Muscle is quite fast. If you need to split your muscle jobs you may tweak "vbro.fasta" and submit 10 jobs ("vbro0.fasta", "vbro*1.fasta" etc.) as a quick and dirty solution.

ADD COMMENT • link 12.8 years ago by Darked89 4.7k

0

Entering edit mode

Thank you! Python's glob module is totally new to me. I ran this on a subset of 10 fasta and it worked like a charm. I'm inspecting those alignments now and if all look tight I'll run your script on the entire data set. This is especially useful bc I can use versions of this as I move forward TrimAl and RAxML.

ADD REPLY • link 12.8 years ago by Louis ▴ 50

score 0 · Answer 3 · 2018-01-09

0

Entering edit mode

6.9 years ago

ATCG ▴ 400

Hi, I have 366 RNAseq samples, reads for each sample are located in individual folders with the following name structure:SAME-DIFFERENT. I would like to write a loop to align all samples using the STAR aligner. Any help is deeply appreciated.

ADD COMMENT • link 6.9 years ago by ATCG ▴ 400