HI All,
I used bwa to align paired-end reads to a reference. I'm trying to convert the SAM file to a BAM file, but am getting the following error:
[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] parse error at line 25199
[main_samview] truncated file.
It seems the qual line is split on two different lines. I had previous error before this one for line 25197. I found the sequence in the fastq's and the qual line started with 3 '@' symbols for read 1. I removed sequence from both fastq files and it appears there are other offending sequences. The qual and sequence line in the fastq's are the same length, but in the sam file they are not.
This is the sequence from read 1 fastq:
@SALLY:355:C2JMJACXX:2:1101:2330:1996 1:N:0:GGCTAC
GTTGTGAAATAATTAAAATGTTGGCATTGATTGTGCATGTTTGTCACGTGCAAGAGGCATGCA
+
:11A===BB,2CF>BFBGC@CFHEC3ACC+<2A21:*:?G@@GD<?CG?F#0?DGHF1-)=CG
This is the sequence from read 2 fastq:
@SALLY:355:C2JMJACXX:2:1101:2330:1996 2:N:0:GGCTAC
GAAGACACCCGGGGTCATCATGGGATCATTCTGGTACTTTTTATGGGACACACGTGAACATCATGTGATCACATGCTGTGCATGCCTCTTGCACGTGACAA
+
BC<DDADDCDCDFB?1?9::?FBDG@?FFFIGIG9BGEFGIIECGF2FCF==BB@AA1?@@??CBDCCCDAAC@;-;;-5:>@A>@ACDACCCDD288?:>
Unfortunately it's not the fastq files. They check out with 0 errors from FastQValidator.
Do you have bioawk? Please also try this code:
Had to compile it quick. No output for either fastq file so they look good. I'm using the most recent bwa and samtools. The trouble is this process worked for other samples, but not these data. I know there are more offending lines in the SAM file.
I tried the same if statement on the SAM file, changing the -c flag to SAM. The first offending line number is still 25199, and continues to the end of the file.
I tried picard tools ValidateSamFile, and there are invalid fastq characters, which I thought first looked odd when I looked at the sam file. They boxes with letters and numbers in them.
Maybe the problem was on the mapping step. Did you try running bwa again? Or a different mapper, e.g., bowtie2?
I think I narrowed down the problem to bwa aln. I had -I in flag on the command in a bash script for previous data. These data are phred 33, not 64, which is indicated by the -I flag. I'll know for sure once the alignments and sampe are complete. The other samples however did not raise an error, and they are from the same Illumina run.
It seems your data is from 100bp paired end, according to the BWA faq, bwa mem is preferred over bwa aln:
"There are three algorithms, which one should I choose?
For 70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences, BWA-MEM is usually the preferred algorithm. For short sequences, BWA-backtrack may be better. BWA-SW may have better sensitivity when alignment gaps are frequent."
Is sam output he default for bwa mem?
yes, sam is the output from bwa mem.