Question

What is corrupting my SAM files?

0

Entering edit mode

6.6 years ago

arudhir ▴ 10

I frequently run into the problem where my SAM file (obtained from Bowtie2) is butchered. The most common problems are differences in lengths between sequences, CIGAR strings, and quality scores (confirmed by both the output from samtools and Picard's ValidateSamFile). This happens across many different samples and I have spent hours if not days trying to figure out why this happens.

For my sequencing analysis pipelines, I end up having to use bbmap's reformat.sh tool that can toss bad reads, but this feels like an unsatisfying solution that is just ignoring some greater problem that is occurring.

Has anyone else experienced this? Could the problem be with the fastq files I start out with? Could anyone shed some light on what is going on behind the scenes in Bowtie that is causing this problem? Is having to deal with corrupted SAM reads unavoidable? How do you guys handle this problem?

Below are details of my setup, let me know if you want more information:

Bowtie2 version: 2.3.3.1
samtools version: 1.7
htslib: 1.7
Linux version 4.4.0-1052-aws (buildd@lgw01-amd64-031) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9) ) #61-Ubuntu SMP Mon Feb 12 23:05:58 UTC 2018

sequence sam bam samtools • 1.9k views

ADD COMMENT • link 6.6 years ago by arudhir ▴ 10

1

Entering edit mode

Find some erroneous records, and post them here. If you are using a public reference genome, tell us which. In this case, try to create a test fastq file which reproduces the error, and make it available, along with the commands you used.

Is having to deal with corrupted SAM reads unavoidable? How do you guys handle this problem?

No, having to deal with systematic corruption sam files is the exception, so we need a reproducible example to see what is happening.

ADD REPLY • link 6.6 years ago by h.mon 35k

0

Entering edit mode

Hello arudhir,

the most interessting part are the exact command you use from your raw data until the step you notice that the sam file is corupted.

Are you using any qualtitrimming program after the alignment step? As the CIGAR strings and quality scores are based on the fastq files during the alignment, I cannot believe that the fastq files are the problem.

fin swimmer

ADD REPLY • link 6.6 years ago by finswimmer 16k

score 0 · Answer 1 · 2018-04-07

Thanks for the answers guys, but I think it's been solved (running my pipeline right now, no issues yet!)

The Bowtie2 update from Dec. 29th 17 addressed a problem with corrupted SAM output when using multiple threads

Which is fantastic, because now I can smoothly just pipe everything now and save on the unnecessary I/O I was forced to do.