Problems with Megahit wanting to overwrite the Input for the output directory
0
0
Entering edit mode
7 weeks ago
Jan • 0

I am doing a genome assembly for three samples with Megahit. I want Megahit to save the output files in the respective sample folders, which are stored in the variable PART.

R1_FILES=($(find ${PART} -name "*_R1_merged.fq.gz"))
R2_FILES=($(find ${PART} -name "*_R2_merged.fq.gz"))

megahit -1 "${R1_FILES[0]}" -2 "${R2_FILES[0]}" -o $PART --out-prefix $PART -m 60e9 -t 8

This does not work, because Megahit wants to overwrite the input folder deleting the files, while I just want to save the output in the same directory. According to the documentation Megahit should just create a new folder named "megahit_out" in each of the three sample folders. If I reference a new output folder it works, but then I can not tell which contig files belong to which sample and it gets very complicated in the following coding steps, because I want to keep working with the PART variable if possible.

metagenomics megahit slurm • 369 views
ADD COMMENT
0
Entering edit mode

"${R1_FILES[0]}" -2 "${R2_FILES[0]}"

hum... are you sure find will output the R1 and the R2 in the same order ? https://www.baeldung.com/linux/find-default-sorting-order

ADD REPLY
0
Entering edit mode

If I reference a new output folder it works, but then I can not tell which contig files belong to which sample and it gets very complicated in the following coding steps, because I want to keep working with the PART variable if possible.

how about just using

(...) -o $PART --out-prefix "${PART}.megahit" -m 60e9 -t 8

and then something like

mv  "${PART}.megahit"  ${PART}/megahit"

?

ADD REPLY
0
Entering edit mode

Thanks for your help first of all! I am a bit frustrated right now. Somehow the program is not grabbing the correct directories when I am using my variable (PART).

cd /working/directory

module load sratoolkit

module load megahit

PART=$(sed -n ${SLURM_ARRAY_TASK_ID}p < metagenomics.run)

cd $TMPDIR

R1_FILES=($(find ${PART} -name "*_R1_merged.fq.gz"))
R2_FILES=($(find ${PART} -name "*_R2_merged.fq.gz"))

megahit -1 "${R1_FILES[0]}" -2 "${R2_FILES[0]}" -o "${PART}"/ --out-prefix "${PART}.megahit" -m 60e9 -t 8

module load conda 

conda activate bbmap-39.09

sed 's/>.*/>'"${PART}".contig&/' "$PART/$PART.contigs.fa" > "$PART.contig.fasta"

module load prodigal

prodigal -i $PART.contig.fasta -o $PART.assembly.gff -f gff -a $PART.prot.fasta -d $PART.nucl.fasta -p meta

gzip *.gff

mv $TMPDIR/*fasta.gz /working/directory/contig

mv $TMPDIR/*gff.gz /working/directory/contig

The file "metagenomics.run" just references my 3 samples like this

sample1
sample2
sample3

When I run the script now, I get the following error message:

FileNotFoundError: [Errno 2] No such file or directory: '/working/directory/sample1/\n/working/directory/sample2/\n/working/directory/sample3/'

ADD REPLY
1
Entering edit mode

At the top of the script add :

set -u
set -e
set -o pipefail

set -x

( https://gist.github.com/mohanpedala/1e2ff5661761d3abd0385e8223e16425 ) and re-run.

Furthermore, this should be a workflow like Snakemake or nextflow.

ADD REPLY

Login before adding your answer.

Traffic: 1307 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6