ERROR: SAM file generated by BWA is corrupted and cannot be processed with SAMTOOLS
1
0
Entering edit mode
9 weeks ago
Luwell • 0

Hi, I am new to metagenomic data analysis and encountered an issue when removing host sequences from my data. After aligning my reads using bwa mem and generating a SAM file, I tried to process it with samtools, but the file appears to be corrupted.

Below is the workflow I used

#1.Align reads to the reference genome 
bwa mem -t 8 ${index} ${R1} ${R2} > ${SAM}
#2.Remove host sequences and generate clean FastQ files
samtools view -bS ${SAM} | samtools fastq -f 4 -1 ${CLEAN_R1} -2 ${CLEAN_R2}

The SAM file was generated successfully, and the file size looks normal.

enter image description here

However, this resulted in garbled output and errors.

enter image description here

I tried to debug.I inspected the content of the SAM file with the following command:

head -n 560 L1EIJ2304789.aligned.sam

Here is the output.It seems the SAM file contains extra messages (e.g., [M::process] and [M::mem_pestat]) that interfere with its format.

1  [M::bwa_idx_load_from_disk] read 0 ALT contigs
2  @SQ SN:NC_000001.11 LN:248956422
3  @SQ SN:NT_187361.1 LN:175055
4  @SQ SN:NT_187362.1 LN:32032
5  @SQ SN:NT_187363.1 LN:127682
6  @SQ SN:NT_187364.1 LN:66860
...
542 @SQ SN:NT_187583.1 LN:204059
543 @SQ SN:NW_003315936.1 LN:154407
544 @SQ SN:NW_003871073.1[M::process] read 530422 sequences (80000034 bp)...
545 [M::process] read 530342 sequences (80000264 bp)...
546 [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 225, 0, 0)
547 [M::mem_pestat] skip orientation FF as there are not enough pairs
548 [M::mem_pestat] analyzing insert size distribution for orientation FR...
549 [M::mem_pestat] (25, 50, 75) percentile: (23, 23, 24)
550 [M::mem_pestat] low and high boundaries for computing mean and std.dev: (21, 
26)
551 [M::mem_pestat] mean and std.dev: (23.48, 0.68)
552 [M::mem_pestat] low and high boundaries for proper pairs: (20, 27)
553 [M::mem_pestat] skip orientation RF as there are not enough pairs
554 [M::mem_pestat] skip orientation RR as there are not enough pairs
555 [M::mem_process_seqs] Processed 530422 reads in 176.080 CPU sec, 128.994 real 
sec LN:200998
556 @SQ SN:NW_003871074.1 LN:191409
557 @SQ SN:NT_187582.1 LN:67707

I re-downloaded the reference genome and re-generated the BWA index to ensure there were no issues with the reference files. Additionally, I re-ran the bwa mem alignment multiple times and tested different samples to rule out sample-specific problems. Despite these efforts, the issue with the SAM file persists.

I havent been able to find a definitive solution to this issue. How should I address this problem? Have I missed any critical steps in my workflow? Any guidance or suggestions would be greatly appreciated! Thank you in advance.

samtools bwa NGS alignment sam • 631 views
ADD COMMENT
0
Entering edit mode

I tried to process it with samtools, but the file appears to be corrupted.

What operation did you try to do (e.g. convert to BAM/sort)? Post the exact error message.

ADD REPLY
0
Entering edit mode

Thank you, I have re-edited my question. I aligned the reads using bwa mem and generated the SAM file, but it seems the SAM file contains a lot of extra information. Each sample generates a SAM file with the same issue. I have tried re-running bwa mem, re-downloading the reference genome, and regenerating the index, but the problem persists.

ADD REPLY
2
Entering edit mode
9 weeks ago
GenoMax 150k

Problem is because you used a redirect (>) to create the SAM file that seems to have also captured STDERR output in SAM file.

You could pipe the output of bwa directly to the fastq like below:

bwa mem -t 8 ${index} ${R1} ${R2} | samtools fastq -f 4 -1 ${CLEAN_R1} -2 ${CLEAN_R2}
ADD COMMENT
0
Entering edit mode

True, there is clearly the bwa log messages in the SAM file, but from above code that should not even happen. Anyway, the file is clearly corrupted so you must start over.

ADD REPLY
0
Entering edit mode

Thank you so much. Using the pipe (|) actually solved the problem.I never thought there would be a problem with the redirect.

ADD REPLY
0
Entering edit mode

It's probably a result of using nohup, or I think some batch schedulers may do this internally and suffer from similar issues of mixing stdout and stderr together. For this reason, you're always better off using a -o output_file type option if the tool permits it (which bwa mem does).

ADD REPLY

Login before adding your answer.

Traffic: 1929 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6