How to run a set or batch of genome assemblies at once in one go?
1
1
Entering edit mode
8.7 years ago
jerrybug109 ▴ 20

Hi All,

I'm trying to assemble several dozen prokaryotic genomes using SPAdes. My inputs are paired end illumina reads (2x125). I've learned how to use the software but am unfamiliar with programming - when it comes to bioinformatics, I just know basic unix commands and how to navigate and manipulate files and directories in my university's linux server.

The command in SPAdes I use for a single genome assembly is:

spades.py --careful -1 my_forward.fastq.gz -2 my_reverse.fastq.gz -o /my/output/directory

It seems time consuming to run each genome assembly one by one. Is there a way to run the entire set of separate genome assemblies in one go, so as to save time and trouble? Do I need to know python script? I would appreciate your input, thank you!

genome Assembly genomics software • 11k views
ADD COMMENT
1
Entering edit mode

Hello,

I am trying to combine multiple files into one big assembly with spade, so that I just get one scaffold file.

The responses above are helpful for multiple assemblies, but I am just aiming for one.

I appreciate any suggestions that I can get, thanks!

ADD REPLY
1
Entering edit mode

Merge the paired reads files in to two files:

For lots of PE reads file:

A_1.fq.gz   A_2.fq.gz  B_1.fq.gz  B_2.fq.gz ...

Merge them:

gzip -d -c  *_1.fq.gz | gzip -c > merged_1.fq.gz
gzip -d -c  *_2.fq.gz | gzip -c > merged_2.fq.gz

PS: replace with pigz if you install it, which is much faster than gzip.

PS2: gzip -d -c is equal to zcat.

PS3: if you have decompress .gz file, just cat *_1.fq > merged_1.fq

ADD REPLY
0
Entering edit mode

@shenwei356, thanks alot for your help! :)

ADD REPLY
0
Entering edit mode

You should probably ask this as a separate question, not an answer to another thread..

ADD REPLY
0
Entering edit mode

@jrj.healey, noted! This is my first time posting, thanks. :)

ADD REPLY
4
Entering edit mode
8.7 years ago

SPAdes, by default, uses 16 threads (says the manual). Are you running it with that default? Is your university server a distributed system (PBS, SLURM etc.) or is it just one big server? If you want to run all of your SPAdes on one big shared server you might want to talk to the system administrators first, they'll get angry if you block the entire thing for days.

If it's one big server and you're OK to go, you can use several ways. You can run send jobs to the background, one for each assembly, for example in bash:

spades.py --careful -1 my_forward.fastq.gz -2 my_reverse.fastq.gz -o /my/output/directory. &

spades.py --careful -1 my_forward2.fastq.gz -2 my_reverse2.fastq.gz -o /my/output/directory2. &

(notice the &)

Then with the "jobs" command you can see all running jobs, and with "fg 1", "fg 2" etc. you can get them back to the foreground, and with CTRL+Z and then entering "bg" you can send them back to the background.

You can also use a for-loop to start all jobs at once:

for file1 in *R1*fastq
do 
file2=${file1/R1/R2}
out=${file1%%.fastq}_output
spades.py --careful -1 $file1 -2 $file2 -o $out &
done

This will iterate over all files containing "R1" and ending in fastq, get the second file by replacing R1 by R2, and puts the output into a path based on R1 but with the ".fastq" cut off, and with "_output" added

That's the easiest way, but then you can't directly quit the current session. For that, have a look at the "screen" command.

ADD COMMENT
0
Entering edit mode

Ah you bring up a good reminder for me - our uni has two servers, one that's shared and one that's not. I'm currently on the shared so I'll see if I can work something out. Thanks for the examples! I'll have a go at it given the chance.

ADD REPLY
0
Entering edit mode

Your first example says how to run those two jobs simultaneously in the background, correct?

If I want to run a series of jobs sequentially instead of simultaneously, would this do it:

( job1 ; job2) &

or more specifically:

(spades.py --careful -1 my_forward.fastq.gz -2 my_reverse.fastq.gz -o /my/output/directory ; spades.py --careful -1 my_forward2.fastq.gz -2 my_reverse2.fastq.gz -o /my/output/directory) &

It seems that I can avoid the potential to hog the server if I just let things run one by one instead of simultaneously. Thanks!

ADD REPLY
0
Entering edit mode

Yes, if you add "&" each job is run in the background, so they all run at the same time, possibly killing your server.

If you want to run it sequentially, you can either do it your way with ";", try this example:

echo "hi" ; sleep 2; echo "hello again"

This will print "hi", then sleep for 2 seconds, then print "hello again".

(Side-note: you can also use "&&",

echo "hi" && sleep 2 && echo "hello again"

this will abort if one of the commands returns an error)

You can also run the above for loop without the "&" if you're feeling lazy and don't want to spell out all commands:

for file1 in *R1*fastq
do 
file2=${file1/R1/R2}
out=${file1%%.fastq}_output
spades.py --careful -1 $file1 -2 $file2 -o $out
done

Testing it with echo only, won't run the command, just print it:

for file1 in *R1*fastq
do 
file2=${file1/R1/R2}
out=${file1%%.fastq}_output
echo spades.py --careful -1 $file1 -2 $file2 -o $out
done

It's easier to put that into a bash script and execute that script via "bash run_all_assemblies.sh"

ADD REPLY
0
Entering edit mode

This a good idea, thank you.

ADD REPLY
0
Entering edit mode

Probably also worth pointing out, even if you're the sole user of your server resource, you are still limited by your number of cores. For example, if your server is a 32 core machine, and you try to launch 3 instances of SPAdes each with 16 cores, all that will happen is that those 3 will complete slowly as they fight for CPU time, and it'll probably end up slower than running 3 sequentially - assuming it completes at all.

ADD REPLY
0
Entering edit mode

@Philipp Bayer Thank you for your example. I was able to apply this to another program (UPARSE fastq_mergepairs) using a directory of over 700 files. This saved me a lot of time.

ADD REPLY

Login before adding your answer.

Traffic: 1700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6