I'm running BWA to align my reads to a big reference. I started using samtools view to obtain directly the BAM file, but when I tried to sort the BAM, it gave me the error:
[E::sam_hdr_sanitise] Malformed SAM header at line 2
So I tried to use Picard CleanSam to try and solve any possible problem with unmapped reads, but it gave me the next error:
[Tue Apr 14 19:04:47 CEST 2020] picard.sam.CleanSam done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2039414784
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -2083544369
at java.lang.String.checkBounds(String.java:381)
at java.lang.String.<init>(String.java:324)
at htsjdk.samtools.util.StringUtil.bytesToString(StringUtil.java:301)
at htsjdk.samtools.util.BinaryCodec.readString(BinaryCodec.java:478)
at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:701)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:208)
at picard.sam.CleanSam.doWork(CleanSam.java:78)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
So I tried to directly use Samtools sort instead of Samtools view to sort the SAM as it came from BWA, but to no avail, same error:
[E::sam_hdr_sanitise] Malformed SAM header at line 2
I'm getting this error in approximatedly half of the samples. The header of the BAM files looks like this:
@SQ SN:abyss_v4.2_1 LN:73
@SQ SN:abyss_v4.2_2 LN:86
@SQ SN:abyss_v4.2_3 LN:86
@SQ SN:abyss_v4.2_4 LN:133
@SQ SN:abyss_v4.2_5 LN:154
@SQ SN:abyss_v4.2_6 LN:213
@SQ SN:abyss_v4.2_7 LN:74
@SQ SN:abyss_v4.2_8 LN:71
@SQ SN:abyss_v4.2_9 LN:71
And the code used is:
bwa mem -t 12 -M REFERENCEPATH READFILEPATH | samtools sort -@12 -o BAMFILEPATH
Any ideas about what can cause this problem?
Cheers!
What is your samtools version and how much memory is available at the node you use? How many jobs run in parallel?
Hi there, I realised I was not using the latest Samtools, it was Samtools 1.09 (it was listed as 1.9, that's where my mistake comes from), latest samtools gives me the next error:
But as I asked in the reply below, are they really that long? Should I trim the names of the reference or just cut the redundant part of the headers in the BAM file?
The node has 128 GB, This process was specified to use 100GB.
Ok, then I suggest you upgrade samtools as many things have improved since that ancient version. Yes, there is a maximum length for the header, see https://github.com/samtools/samtools/issues/1105
It is suspicious though that the error only ocurs with some samples. If it was a header problem then it should happen with every file that uses the same reference, right? Maybe the problem goes away with the most recent version? There were a changes in samtools to better handle long reads, maybe they made changes for the header too, Just try it. 1.09 is ancient, so this can indeed be a problem. Just align a few reads from the problematic files to test it.