Alignment tool for very short reads (25bp)
2
3
Entering edit mode
5 months ago

Hello Good People,

I am trying to do a short read alignment of next-generation sequencing reads from human samples. The samples were sequenced in the Illumina machine, the reads are paired-end and the length is 25bp in each read. Each sample has several million to billion reads for each end.

My goal is to perform short-read alignment against the GRCh38 reference genome.

01. Attempt with bwa-mem

At first, I used the bwa mem command and the output seemed somewhat okay. Especially the insert sizes in the bam file were in the expected range. However, the mapping quality of almost all reads was zero (MAPQ = 0).

02. Attempt with bwa-sampe

From the github page of the bwa tool, I found that, for Illumina paired-end reads shorter than ~70bp it is recommended to use bwa sampe along with bwa aln. So, I followed the recommendation and ran a single sample on a high-performance cluster. While bwa aln has the multi-threading option, the bwa sampe runs only on a single thread. This constraint makes the command awfully slow and it is taking forever to run on just one sample.

03. Parallelization attempt by splitting reads with seqkit split2

So I made an attempt to run the command on multiple threads. In this case, I chose a relatively smaller genome and sequencing data of saccharomyces cerevisiae, just to do a quick check.

I split the fastq files with seqkit split2 command and used the GNU parallel command to parallelize the alignment part by bwa sampe. Then I merged all the bam files with the samtools merge command. For comparison, I also generated a bam file by the bwa mem command.

Unfortunately, I found the output of bwa mem and bwa sampe very different.

04. Attempt with bowtie2 tool

Then I also tried the bowtie2 tool with default options. But the output is quite different from bwa mem. Though mapping quality increased but the insert sizes are very different from our expectation. I believe, maybe by optimizing the options bowtie2 can be used for very short read lengths. But, the options seem quite difficult to comprehend.

At this point, my queries are:

  1. Is splitting fastq files by the samtools split2 command a reliable option?
  2. Can you recommend to me the optimum combination of options ( or related articles) of the bowtie2 tool for the alignment of 25-50 bp read length?
  3. What are other reliable tools for the alignment of DNA reads with very short lengths (25-50bp)?

Sorry, for the very long post. I wanted to explain the context before asking the question. Thanks in advance.

mapper aligner bwa alignment bowtie2 • 947 views
ADD COMMENT
1
Entering edit mode

bowtie aka bowtie1 was developed when these very short reads were state of the art. Id give that a try and see what comes out. Multithreaded, will scale.

ADD REPLY
1
Entering edit mode
ADD REPLY
1
Entering edit mode
5 months ago
Jack Tierney ▴ 410

Kudos for writing such a clear and well formatted question. I will give my perspective but don't claim to be an authority. Full disclosure my background is in ribosome profiling data processing/analysis in which we work exclusively with short ribosome protected fragments (~28-32nt) however, paired end sequencing is not possible in Ribo-Seq so I have a more cursory knowledge in that aspect.

I will address your queries in order:

  1. Is splitting fastq files by the samtools split2 command a reliable option?

Firstly, I believe this was just a typo but the tool you want to use prior to alignment is seqkit split2 which is designed to split FASTQ files. My understnading is samtools split will split BAM/SAM files. Splitting read files prior to alignment is commonly done to help parallelise the process. It is important to ensure that paired reads end up in the same file after splitting but seqkit split2 should do this for you.

  1. Can you recommend to me the optimum combination of options ( or related articles) of the bowtie2 tool for the alignment of 25-50 bp read length?

bowtie2 is a versatile aligner and I do not doubt that you can play with the parameters to obtain better alignments. Specifically you would need to shorten the seed length (eg. -L 20) and to specify end-to-end rather than local alignment (--end-to-end) is performed as this is more reliable for short reads. It would also make sense to lower the number of accepted mismatches in the reads to a maximum of 3 (-N 3).

However, the original bowtie although old was specifically designed for reads of this length and does end-to-end by default. It is also remarkably efficient. The release notes for bowtie2 (from 2011 so may not still hold) note that for reads <50bp bowtie1 is "sometimes faster and/or more sensitive". I have not done benchmarking even for our own single-end reads so this is anecdotal.

For both bowties you can specify the max and min insert length which might also help you achieve the results you expect.

  1. What are other reliable tools for the alignment of DNA reads with very short lengths (25-50bp)?

I think you should not need to look outside of bwa, bowtie2 and bowtie to get reliable results. But this benchmarking paper might be of interest if you are not satisfied with these options.

ADD COMMENT
0
Entering edit mode

Thank you too for a very nice and clear suggestion. Sure, I will check the bowtie2 options.

ADD REPLY
1
Entering edit mode
5 months ago

What are other reliable tools for the alignment of DNA reads with very short lengths (25-50bp)?

vmatch

ADD COMMENT
0
Entering edit mode

Thank you very much for your reply.

ADD REPLY

Login before adding your answer.

Traffic: 1768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6