I have about 30 GB (gzipped fastq) of small RNA (miRNA, etc.) sequencing data that I've been asked to align to the human genome (hg38), reporting for each read every position in the genome where it matches exactly. Bowtie2 has been my go-to mapper for short reads in the past, so I put together a pipeline something like this:
zcat Sample_1.fastq.gz | \
cutadapt -a TGGAATTCTCGGGTGCCAAGG -f fastq --overlap 15 \
--trimmed-only --minimum-length 10 - | \
bowtie2 -x ~/references/hg38/hg38-bt2/hg38 -U - -q --phred33 \
--end-to-end --very-sensitive -a --score-min C,0,-1 -L 10 \
picard-tools SortSam INPUT=/dev/stdin OUTPUT=Sample_1.bam;
picard-tools BuildBamIndex INPUT=Sample_1.bam OUTPUT=Sample_1.bam.bai
However, the Bowtie2 manual carries this warning about the -a
option for reporting all alignments for each read:
Some tools are designed with this reporting mode in mind. Bowtie 2 is not! For very large genomes, this mode is very slow.
Nevertheless, because I already had it set up, I decided to give it a try. Apparently, the statement is quite accurate, because one sample (about 300 MB before trimming) has already been running for 168 CPU-hours without finishing, which is at least an order of magnitude or two longer than I'd typically expect a 300 MB fastq file to require when looking for just one or a few alignments for each read. So, is there either another set of options for bowtie2 or, more likely, another aligner better suited for the task of quickly finding all exact matches for each read?
Try BBMap. You would want to use
ambig=all
option to get all matches. Play with other other parameters (idfilter
) for getting precise matches.This seems like a good option. Maybe you should convert this comment to an answer?