Hello Good People,
I am trying to do a short read alignment of next-generation sequencing reads from human samples. The samples were sequenced in the Illumina machine, the reads are paired-end and the length is 25bp in each read. Each sample has several million to billion reads for each end.
My goal is to perform short-read alignment against the GRCh38 reference genome.
01. Attempt with bwa-mem
At first, I used the bwa mem
command and the output seemed somewhat okay. Especially the insert sizes in the bam file were in the expected range. However, the mapping quality of almost all reads was zero (MAPQ = 0).
02. Attempt with bwa-sampe
From the github page of the bwa tool, I found that, for Illumina paired-end reads shorter than ~70bp it is recommended to use bwa sampe
along with bwa aln
. So, I followed the recommendation and ran a single sample on a high-performance cluster. While bwa aln
has the multi-threading option, the bwa sampe
runs only on a single thread. This constraint makes the command awfully slow and it is taking forever to run on just one sample.
03. Parallelization attempt by splitting reads with seqkit split2
So I made an attempt to run the command on multiple threads. In this case, I chose a relatively smaller genome and sequencing data of saccharomyces cerevisiae, just to do a quick check.
I split the fastq files with seqkit split2
command and used the GNU parallel
command to parallelize the alignment part by bwa sampe
. Then I merged all the bam files with the samtools merge
command. For comparison, I also generated a bam file by the bwa mem
command.
Unfortunately, I found the output of bwa mem
and bwa sampe
very different.
04. Attempt with bowtie2
tool
Then I also tried the bowtie2
tool with default options. But the output is quite different from bwa mem
. Though mapping quality increased but the insert sizes are very different from our expectation. I believe, maybe by optimizing the options bowtie2
can be used for very short read lengths. But, the options seem quite difficult to comprehend.
At this point, my queries are:
- Is splitting fastq files by the
samtools split2
command a reliable option? - Can you recommend to me the optimum combination of options ( or related articles) of the
bowtie2
tool for the alignment of 25-50 bp read length? - What are other reliable tools for the alignment of DNA reads with very short lengths (25-50bp)?
Sorry, for the very long post. I wanted to explain the context before asking the question. Thanks in advance.
bowtie aka bowtie1 was developed when these very short reads were state of the art. Id give that a try and see what comes out. Multithreaded, will scale.
use bowtie1 for short reads https://sourceforge.net/projects/bowtie-bio/files/bowtie/1.3.1/