Question

Problems with Megahit wanting to overwrite the Input for the output directory

0

Entering edit mode

7 weeks ago

Jan • 0

I am doing a genome assembly for three samples with Megahit. I want Megahit to save the output files in the respective sample folders, which are stored in the variable PART.

R1_FILES=($(find ${PART} -name "*_R1_merged.fq.gz"))
R2_FILES=($(find ${PART} -name "*_R2_merged.fq.gz"))

megahit -1 "${R1_FILES[0]}" -2 "${R2_FILES[0]}" -o $PART --out-prefix $PART -m 60e9 -t 8

This does not work, because Megahit wants to overwrite the input folder deleting the files, while I just want to save the output in the same directory. According to the documentation Megahit should just create a new folder named "megahit_out" in each of the three sample folders. If I reference a new output folder it works, but then I can not tell which contig files belong to which sample and it gets very complicated in the following coding steps, because I want to keep working with the PART variable if possible.

metagenomics megahit slurm • 366 views

ADD COMMENT • link updated 7 weeks ago by Pierre Lindenbaum 164k • written 7 weeks ago by Jan • 0

0

Entering edit mode

"${R1_FILES[0]}" -2 "${R2_FILES[0]}"

hum... are you sure find will output the R1 and the R2 in the same order ? https://www.baeldung.com/linux/find-default-sorting-order

ADD REPLY • link 7 weeks ago by Pierre Lindenbaum 164k

0

Entering edit mode

If I reference a new output folder it works, but then I can not tell which contig files belong to which sample and it gets very complicated in the following coding steps, because I want to keep working with the PART variable if possible.

how about just using

(...) -o $PART --out-prefix "${PART}.megahit" -m 60e9 -t 8

and then something like

mv  "${PART}.megahit"  ${PART}/megahit"

?

ADD REPLY • link 7 weeks ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for your help first of all! I am a bit frustrated right now. Somehow the program is not grabbing the correct directories when I am using my variable (PART).

cd /working/directory

module load sratoolkit

module load megahit

PART=$(sed -n ${SLURM_ARRAY_TASK_ID}p < metagenomics.run)

cd $TMPDIR

R1_FILES=($(find ${PART} -name "*_R1_merged.fq.gz"))
R2_FILES=($(find ${PART} -name "*_R2_merged.fq.gz"))

megahit -1 "${R1_FILES[0]}" -2 "${R2_FILES[0]}" -o "${PART}"/ --out-prefix "${PART}.megahit" -m 60e9 -t 8

module load conda 

conda activate bbmap-39.09

sed 's/>.*/>'"${PART}".contig&/' "$PART/$PART.contigs.fa" > "$PART.contig.fasta"

module load prodigal

prodigal -i $PART.contig.fasta -o $PART.assembly.gff -f gff -a $PART.prot.fasta -d $PART.nucl.fasta -p meta

gzip *.gff

mv $TMPDIR/*fasta.gz /working/directory/contig

mv $TMPDIR/*gff.gz /working/directory/contig

The file "metagenomics.run" just references my 3 samples like this

sample1
sample2
sample3

When I run the script now, I get the following error message:

FileNotFoundError: [Errno 2] No such file or directory: '/working/directory/sample1/\n/working/directory/sample2/\n/working/directory/sample3/'

ADD REPLY • link 7 weeks ago by Jan • 0

1

Entering edit mode

At the top of the script add :

set -u
set -e
set -o pipefail

set -x

( https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e8223e16425 ) and re-run.

Furthermore, this should be a workflow like Snakemake or nextflow.

ADD REPLY • link 7 weeks ago by Pierre Lindenbaum 164k