Question

Multiple sequence aligments - parallelisation

0

Entering edit mode

15 months ago

Lada ▴ 30

I have a folder with approx 3.000 fasta files. Each fasta FILE corresponds to one gene (orthogroup) and it contains multiple sequences (orthologues from multiple species).

I want to do multiple sequence alignments in ClustalO or MUSCLE for each of these files (genes). I have a script that works for SINGLE input fasta file and it creates one multiple sequence alignment.

Does anyone know how can I run these thousands of alignments in parallel or at least submit the job with one script? I really don't want to do it manually for over 3,000 alignment jobs.

Tnx!

MSA phylotranscriptomics muscle clustal • 1.2k views

ADD COMMENT • link 15 months ago by Lada ▴ 30

score 1 · Answer 1 · 2023-08-29

1

Entering edit mode

15 months ago

GenoMax 147k

Can you clarify if you have access to a high performance compute cluster or you are trying to do this with a standalone machine/server?

Creating multiple job via a simple for loop would be a relatively easy task. A generic example of how to do this is here: https://unix.stackexchange.com/questions/536867/looping-through-command-to-submit-several-jobs

Examples of for loops for bash: https://www.cyberciti.biz/faq/bash-for-loop/

Parallelization - where you would have several jobs running in parallel would be a separate consideration. Depending on hardware you have access to you can submit the jobs via a job scheduler in case of a HPC cluster or you could use parallel on a single server --> Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them

ADD COMMENT • link 15 months ago by GenoMax 147k

0

Entering edit mode

Hi GenoMax and sorry for not clarifying my question- yes I am submitting the job(s) via a job scheduler (SGE), so HPC user. :) I might have used the term parallelization wrong, I meant it more in the sense of submitting one job (script) for multiple/separate MSA alignments to be made. I read about the `"for" loops in some previous posts, but I a not familiar with it, so I'll give it a try. Thank you!

ADD REPLY • link 15 months ago by Lada ▴ 30

2

Entering edit mode

Here is a very simple way of doing this. You will need to figure out how to enclose these commands for submission via SGE. The source files are named Sample_1.fa, Sample_2.fa etc in this example.

Note: echo is in the command to just print the commands out to screen and not execute them. basename strips .fa extension from each file name obtained from the loop, so you can use the resulting sample_name to create new output file names like this ${name}.aln.afa.

$ for i in `ls -1 Sample*.fa`; do name=$(basename ${i} .fa); echo muscle -align ${name}.fa -output ${name}.aln.afa; done
muscle -align Sample_1.fa -output Sample_1.aln.afa
muscle -align Sample_2.fa -output Sample_2.aln.afa
muscle -align Sample_3.fa -output Sample_3.aln.afa
muscle -align Sample_4.fa -output Sample_4.aln.afa
muscle -align Sample_5.fa -output Sample_5.aln.afa

ADD REPLY • link 15 months ago by GenoMax 147k

0

Entering edit mode

Thank you so much for making this example! It makes sense to me and I got the general idea.

What does it mean ls -1 ? I know what lsstands for, but why -1 ?

I wrote this script and submitted the job from the same folder where the script and fasta files are (.fa extension). I changed the name of files since I have, for example, OG0008990.fa etc. and added -threads $NSLOTS .

It looks looks this:

#!/bin/bash
#$ -N muscle_SCO
#$ -pe *mpisingle 8
#$ -cwd
#$ -l memory=8

module load bioinfo/muscle/5.1

$ for i in `ls -1 OG*.fa`; do name=$(basename ${i} .fa); echo muscle -threads $NSLOTS -align ${name}.fa -output ${name}.aln.afa; done

But I got this error message: seems like it doesn't recognise do token.

/opt/sge/default/spool/sl250s-gen8-06-05/job_scripts/1655191: line 11: syntax error near unexpected token `do'
/opt/sge/default/spool/sl250s-gen8-06-05/job_scripts/1655191: line 11: `$ for i in `ls -1 OG*.fa`; do name=$(basename ${i} .fa); echo muscle -threads $NSLOTS -align ${name}.fa -output ${name}.aln.afa; done'

I'l try to figure this out for SGE and post it here when I get it right, but this is a very good start! Thank you!

ADD REPLY • link 15 months ago by Lada ▴ 30

0

Entering edit mode

I forgot to remove echo! Now it works perfectly fine! It will take some time but I already see in the stdout file that muscle is doing its job and I see my alignment files being created. Big thanks, you helped so much!

ADD REPLY • link 15 months ago by Lada ▴ 30