how to de novo assemble a large number of bacterial genome with spades in Linux
1
0
Entering edit mode
4.8 years ago
haomingju • 0

Hi, I am a freshman in sequencing data analysis. When i have one fastq file for only one bacteria , i know how to assemble using Spades. For example, "spades.py --pe1-1 name.fq.gz --pe1-2 name.fq.gz -o spades_test". But I don't know how to deal with a large number of samples with one linux command. For example, when i have 10 fastq data (name1~name10), i won't like to assemble them one by one by hand. Can you tell me how can i do ? Thanks!

assembly • 1.6k views
ADD COMMENT
1
Entering edit mode

Type bash loop in google.

ADD REPLY
1
Entering edit mode

Take a look at bash for loops.

Just putting these commands in a loop is not going to make these go any faster. If you have access to a cluster you could potentially use a for loop to submit 10 parallel spades jobs otherwise they will run one after the other.

ADD REPLY
1
Entering edit mode

Do you have access to a HPC or computing cluster? You should up your skills and use submission scripts or pipelines to manage this.

ADD REPLY
0
Entering edit mode

You can do it with the help of shell

ADD REPLY
0
Entering edit mode

spades.py --pe1-1 name.fq.gz --pe1-2 name.fq.gz -o spades_test" I guess this says that you have paired end, but fragmented reads. But you have only one fragment per end. I guess you can use -1 and -2 direct. There is also a problem with naming convention in OP.

ADD REPLY
3
Entering edit mode
4.8 years ago
the_cowa ▴ 40

You can do it with the help of shell

#!/bin/bash

for fol in "your fastq directory" ; do
echo $fol

for i in `ls $fol | tr "_" "\t" | cut -f4 | sort | uniq`; do
fitag=`ls $fol | grep $i | head -n1 | sed -e 's/L/\t/g' | cut -f1`
spades.py --pe1-1 $fol$fitag$i"_R1.fastq.gz"  --pe1-2 $fol$fitag$i"_R2.fastq.gz"  -o $fol$fitag$i".out"
done
done
ADD COMMENT
0
Entering edit mode

Although this solution works it should be avoided. As someone who spent a lot of time doing such things I can assure you that you will have to run this command more than once (a lot more actually), with different parameters, different datasets, maybe combine two samples (have I removed adapters?), you got the idea. You'll end up hacking this bash script in some unknown location, not sure which version of it you used to generate the results and when you'll write your manuscript you'll avoid sharing this code because it's, well, I'll say it. Ugly. What should you do? Make your results disposable. Save the input in a well documented, backed-up location and use pipelines to run the analysis, you can either use flowcraft for metagenomics assembly or craft your own. I can't stress this enough, learn how to use pipeline management systems like wdl, nextflow, snakemake, choose one, doesn't really matter which.

ADD REPLY

Login before adding your answer.

Traffic: 1626 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6