I frequently run into the problem where my SAM file (obtained from Bowtie2) is butchered. The most common problems are differences in lengths between sequences, CIGAR strings, and quality scores (confirmed by both the output from samtools and Picard's ValidateSamFile
). This happens across many different samples and I have spent hours if not days trying to figure out why this happens.
For my sequencing analysis pipelines, I end up having to use bbmap's reformat.sh
tool that can toss bad reads, but this feels like an unsatisfying solution that is just ignoring some greater problem that is occurring.
Has anyone else experienced this? Could the problem be with the fastq files I start out with? Could anyone shed some light on what is going on behind the scenes in Bowtie that is causing this problem? Is having to deal with corrupted SAM reads unavoidable? How do you guys handle this problem?
Below are details of my setup, let me know if you want more information:
- Bowtie2 version: 2.3.3.1
- samtools version: 1.7
- htslib: 1.7
- Linux version 4.4.0-1052-aws (buildd@lgw01-amd64-031) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9) ) #61-Ubuntu SMP Mon Feb 12 23:05:58 UTC 2018
Find some erroneous records, and post them here. If you are using a public reference genome, tell us which. In this case, try to create a test fastq file which reproduces the error, and make it available, along with the commands you used.
No, having to deal with systematic corruption sam files is the exception, so we need a reproducible example to see what is happening.
Hello arudhir,
the most interessting part are the exact command you use from your raw data until the step you notice that the sam file is corupted.
Are you using any qualtitrimming program after the alignment step? As the CIGAR strings and quality scores are based on the fastq files during the alignment, I cannot believe that the fastq files are the problem.
fin swimmer