Question

Bowtie2 mapping is slow

0

Entering edit mode

14 months ago

Md Moinuddin • 0

Hi,

I am using bowtie2 on HPC for mapping reads to around 4 million contigs. But it seems extremely slow. Is there any option to make it faster?

Here is parts of my script:

#!/bin/bash
#SBATCH --job-name="bt26"
#SBATCH --partition=Orion
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=36
#SBATCH --time=30-00:00:00

# Run Bowtie2 for paired-end reads
bowtie2 -x "$index_prefix" \
  -1 "$reads1" \
  -2 "$reads2" \
  -S "$output_dir/${sample}.sam";done

Thanks!!

metagenomics bowtie2 mapping read • 1.6k views

ADD COMMENT • link updated 14 months ago by colindaven 7.6k • written 14 months ago by Md Moinuddin • 0

0

Entering edit mode

If you've got a related genome you might be able to scaffold those 4m contigs to reduce them down a bit. Probably many of them only have exons and not full genes on them so utility will be poor. This tool might be useful - https://github.com/malonge/RagTag

In general though you might be able to find long read data for this genotype which will create a 100X better assembly at least .... if not, then maybe next time...

ADD REPLY • link 14 months ago by colindaven 7.6k

score 3 · Accepted Answer · 2024-04-02

3

Entering edit mode

14 months ago

GenoMax 151k

You need to use -p option specify the number of threads that match the cores you are asking in your bowtie2 command line. You will also want to ask for more memory explicitly, if the default allocation on your cluster is low (using #SBATCH --mem=NNg option.

4 million contigs

Yikes! that is a fragmented assembly.

There is no need to make SAM file unless you have a specific reason. Pipe directly into samtools to make a sorted/indexed BAM file.

bowtie2 -x "$index_prefix"  -1 "$reads1"  -2 "$reads2" | samtools sort --write-index -o sorted.bam -

ADD COMMENT • link 14 months ago by GenoMax 151k

0

Entering edit mode

Thank you!! It worked like a charm

ADD REPLY • link 14 months ago by Md Moinuddin • 0

1

Entering edit mode

You might want to run seff on your job ID after it has finished, some SLURM configurations require you to use some variant of srun before your command, and without srun the job won't use all the CPUs that you've requested leading to slow jobs. But that might not be the case on your cluster. seff will tell you what percentage of your required CPUs was actually used.

ADD REPLY • link 14 months ago by Philipp Bayer 8.8k