Question

ERROR: SAM file generated by BWA is corrupted and cannot be processed with SAMTOOLS

0

Entering edit mode

5 months ago

Luwell • 0

Hi, I am new to metagenomic data analysis and encountered an issue when removing host sequences from my data. After aligning my reads using bwa mem and generating a SAM file, I tried to process it with samtools, but the file appears to be corrupted.

Below is the workflow I used

#1.Align reads to the reference genome 
bwa mem -t 8 ${index} ${R1} ${R2} > ${SAM}
#2.Remove host sequences and generate clean FastQ files
samtools view -bS ${SAM} | samtools fastq -f 4 -1 ${CLEAN_R1} -2 ${CLEAN_R2}

The SAM file was generated successfully, and the file size looks normal.

enter image description here

However, this resulted in garbled output and errors.

enter image description here

I tried to debug.I inspected the content of the SAM file with the following command:

head -n 560 L1EIJ2304789.aligned.sam

Here is the output.It seems the SAM file contains extra messages (e.g., [M::process] and [M::mem_pestat]) that interfere with its format.

1  [M::bwa_idx_load_from_disk] read 0 ALT contigs
2  @SQ SN:NC_000001.11 LN:248956422
3  @SQ SN:NT_187361.1 LN:175055
4  @SQ SN:NT_187362.1 LN:32032
5  @SQ SN:NT_187363.1 LN:127682
6  @SQ SN:NT_187364.1 LN:66860
...
542 @SQ SN:NT_187583.1 LN:204059
543 @SQ SN:NW_003315936.1 LN:154407
544 @SQ SN:NW_003871073.1[M::process] read 530422 sequences (80000034 bp)...
545 [M::process] read 530342 sequences (80000264 bp)...
546 [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 225, 0, 0)
547 [M::mem_pestat] skip orientation FF as there are not enough pairs
548 [M::mem_pestat] analyzing insert size distribution for orientation FR...
549 [M::mem_pestat] (25, 50, 75) percentile: (23, 23, 24)
550 [M::mem_pestat] low and high boundaries for computing mean and std.dev: (21, 
26)
551 [M::mem_pestat] mean and std.dev: (23.48, 0.68)
552 [M::mem_pestat] low and high boundaries for proper pairs: (20, 27)
553 [M::mem_pestat] skip orientation RF as there are not enough pairs
554 [M::mem_pestat] skip orientation RR as there are not enough pairs
555 [M::mem_process_seqs] Processed 530422 reads in 176.080 CPU sec, 128.994 real 
sec LN:200998
556 @SQ SN:NW_003871074.1 LN:191409
557 @SQ SN:NT_187582.1 LN:67707

I re-downloaded the reference genome and re-generated the BWA index to ensure there were no issues with the reference files. Additionally, I re-ran the bwa mem alignment multiple times and tested different samples to rule out sample-specific problems. Despite these efforts, the issue with the SAM file persists.

I havent been able to find a definitive solution to this issue. How should I address this problem? Have I missed any critical steps in my workflow? Any guidance or suggestions would be greatly appreciated! Thank you in advance.

samtools bwa NGS alignment sam • 997 views

ADD COMMENT • link updated 5 months ago by jkbonfield ★ 1.3k • written 5 months ago by Luwell • 0

0

Entering edit mode

I tried to process it with samtools, but the file appears to be corrupted.

What operation did you try to do (e.g. convert to BAM/sort)? Post the exact error message.

ADD REPLY • link 5 months ago by GenoMax 152k

0

Entering edit mode

Thank you, I have re-edited my question. I aligned the reads using bwa mem and generated the SAM file, but it seems the SAM file contains a lot of extra information. Each sample generates a SAM file with the same issue. I have tried re-running bwa mem, re-downloading the reference genome, and regenerating the index, but the problem persists.

ADD REPLY • link 5 months ago by Luwell • 0

score 2 · Accepted Answer · 2025-02-06

2

Entering edit mode

5 months ago

GenoMax 152k

Problem is because you used a redirect (>) to create the SAM file that seems to have also captured STDERR output in SAM file.

You could pipe the output of bwa directly to the fastq like below:

bwa mem -t 8 ${index} ${R1} ${R2} | samtools fastq -f 4 -1 ${CLEAN_R1} -2 ${CLEAN_R2}

ADD COMMENT • link 5 months ago by GenoMax 152k

0

Entering edit mode

True, there is clearly the bwa log messages in the SAM file, but from above code that should not even happen. Anyway, the file is clearly corrupted so you must start over.

ADD REPLY • link 5 months ago by ATpoint 88k

0

Entering edit mode

Thank you so much. Using the pipe (|) actually solved the problem.I never thought there would be a problem with the redirect.

ADD REPLY • link 5 months ago by Luwell • 0

0

Entering edit mode

It's probably a result of using nohup, or I think some batch schedulers may do this internally and suffer from similar issues of mixing stdout and stderr together. For this reason, you're always better off using a -o output_file type option if the tool permits it (which bwa mem does).

ADD REPLY • link 5 months ago by jkbonfield ★ 1.3k