Picard Output is malformed
1
1
Entering edit mode
8.6 years ago
haiying.kong ▴ 360

I have WES data and processed it with BWA, Picard, and GATK. All other samples finished nicely, except one sample that keeps giving error message.

After running Picard, the file sizes for correct sample are:

-rwxrwxrwx  1 xxxx C050     7539504 Apr 25 12:34 T146714.readgroups.bai
-rwxrwxrwx  1 xxxx C050 12181279890 Apr 25 11:32 T146714.readgroups.bam
-rwxrwxrwx  1 xxxx C050        1309 Apr 25 09:23 T146714.metrics.txt
-rwxrwxrwx  1 xxxx C050 12150347476 Apr 25 09:23 T146714.dedupped.bam
-rwxrwxrwx  1 xxxx C050 68183740068 Apr 25 04:08 T146714.sorted.sam

the file sizes for the sample in trouble are:

-rwxrwxrwx  1 xxxx C050           8 Apr 27 19:47 T15520.readgroups.bai
-rwxrwxrwx  1 xxxx C050   422055702 Apr 27 19:47 T15520.readgroups.bam
-rwxrwxrwx  1 xxxx C050   420992459 Apr 27 19:45 T15520.dedupped.bam
-rwxrwxrwx  1 xxxx  C050 44875211052 Apr 27 19:43 T15520.sorted.sam

but Picard did not give any error message. Only when running GATK, I got error message:

<h5>ERROR MESSAGE: File /home//Picard/T15520.readgroups.bai is malformed: Premature end-of-file while reading BAM index file</h5>

/home/Picard/T15520.readgroups.bai. It's likely that this file is truncated or corrupt -- Please try re-indexing the corresponding BAM file

Apparently, some thing went wrong at the step of Picard.

Could any one please tell me how to fix this problem?

next-gen picard • 4.8k views
ADD COMMENT
1
Entering edit mode
8.6 years ago
John 13k

I think this is a known bug - or at least, people experience it a lot. Long story short, one of the programs in Picard doesn't always make indexes right (i cant remember if its MarkDuplicates, or something else), and you have to delete the .bai file and re-make it. Just delete the .bai, re-run your last command, and Picard will tell you want to do next :)

ADD COMMENT
0
Entering edit mode

I just remembered something, and I think it's only fair to the Picard developers to mention it - this may not strictly be bug in Picard.

I think it happens because the process which makes the .bai file (probably MarkDuplicates), starts by writing the header of the bai to disk, then generates the BAM you wanted, then at the end fills out the rest of the bai index. But it only does this if there isn't a bai already there.

If you Ctrl+C the Picard process, you end up with a half-made bai file. If you run MarkDuplicates again through to completion, because the bai was already there, Picard doesn't make a new one, and you end up in the situation you are in.

I base all of this on my own experience and without looking at the Picard code, so i might be wrong - but it would be interesting haiying if you can confirm that you, at some point, terminated a Picard process before it finished and then re-ran it...?

ADD REPLY
1
Entering edit mode

Dear John,

Thank you so much for looking into this problem. In the beginning when I started to run my pipeline, I did restart the program because I wanted to adjust my pipeline. But I wrote my pipeline in the way that every time I run it, it will clean up the folder for saving results in every step. Yesterday I left in hurry, and rerun the pipeline only for the failed sample. This does not clean up the folders, and I could not have time to go to each folder to delete results for the failed sample. The outputs from Picard are:

-rwxrwxrwx 1 xxxx C050 0 Apr 29 00:37 T15520.sorted.sam -rwxrwxrwx 1 xxxx C050 8 Apr 27 19:47 T15520.readgroups.bai -rwxrwxrwx 1 xxxx C050 422055702 Apr 27 19:47 T15520.readgroups.bam -rwxrwxrwx 1 xxxx C050 420992459 Apr 27 19:45 T15520.dedupped.bam

You are right, the .bai file is still the one from April 27th.

My Picard tools was downloaded Aug. last year. picard-tools-1.138. I will upgrade it and try again. To be safe, I will delete .bai file before starting my pipeline for the sample.

Thanks again.

ADD REPLY
0
Entering edit mode

You're welcome Haiying! And thank you so much for getting back to us -- I will post our experiences on the GATK forums, and hopefully they can use that to fix this user-error bug so it doesn't happen again for other people :)

ADD REPLY
0
Entering edit mode

I did check the outcome this morning, and it is still not working.

The sample that fails at Picard has output from Picard as:

-rwxrwxrwx 1 xxxx C050           0 Apr 29 21:33 T15520.readgroups.bai
-rwxrwxrwx 1 xxxx C050   422055146 Apr 29 21:33 T15520.readgroups.bam
-rwxrwxrwx 1 xxxx C050   420993638 Apr 29 21:31 T15520.dedupped.bam
-rwxrwxrwx 1 xxxx C050 44875211052 Apr 29 21:29 T15520.sorted.sam

The correct sample has output from Picard as:

-rwxrwxrwx 1 xxxx C050     7539504 Apr 25 12:34 T146714.readgroups.bai
-rwxrwxrwx 1 xxxx C050 12181279890 Apr 25 11:32 T146714.readgroups.bam
-rwxrwxrwx 1 xxxx C050        1309 Apr 25 09:23 T146714.metrics.txt
-rwxrwxrwx 1 xxxx C050 12150347476 Apr 25 09:23 T146714.dedupped.bam
-rwxrwxrwx 1 xxxx C050 68183740068 Apr 25 04:08 T146714.sorted.sam

Some things are not quite right. For the failed sample, dedupped.bam has much smaller file size than sorted.sam. Also, there is no metrics.txt file generated.

The commands I have used for Picard process are:

#########################################
# Picard:
java -Xmx4g -Djava.io.tmpdir=tmp -jar ${Picard}picard.jar SortSam     \
  INPUT=${BWA_dir}${sample}.sam OUTPUT=${Picard_dir}${sample}.sorted.sam SORT_ORDER=coordinate
java -Xmx4g -Djava.io.tmpdir=tmp -jar ${Picard}picard.jar MarkDuplicatesWithMateCigar     \
  INPUT=${Picard_dir}${sample}.sorted.sam OUTPUT=${Picard_dir}${sample}.dedupped.bam     \
  METRICS_FILE=${Picard_dir}${sample}.metrics.txt REMOVE_DUPLICATES=true ASSUME_SORTED=true
IFS='_' read -a RG <<< "$sample"
java -Xmx4g -Djava.io.tmpdir=tmp -jar ${Picard}picard.jar AddOrReplaceReadGroups     \
  I=${Picard_dir}${sample}.dedupped.bam O=${Picard_dir}${sample}.readgroups.bam     \
  RGLB=$sample RGPL=illumina RGPU=$RG[1] RGSM=$sample
java -Xmx4g -Djava.io.tmpdir=tmp -jar ${Picard}picard.jar BuildBamIndex     \
  INPUT=${Picard_dir}${sample}.readgroups.bam OUTPUT=${Picard_dir}${sample}.readgroups.bai

I used this same commands for all samples.

ADD REPLY
0
Entering edit mode

The log for the failed sample from dedupping process is:

I think something went wrong at this step, but i'm not sure what.

ADD REPLY
0
Entering edit mode

Thank you for the detailed reply haiying - although next time instead of pasting a log straight into the forum, please put it in a public paste website like gist - this will help us read it with the right formatting, you wont have to split it over multiple posts :)

So from that error it does indeed look like Picard made a mistake. This time i'll say it's Picard's fault, because the error that it's complaining about is purely an implementation issue, and not your fault. Here's the comment from Picard that explains what's going on:

Check that we are not incorrectly performing any duplicate marking, by having too few of the records. This can happen if the alignment start is increasing but 5' soft-clipping is increasing such that we miss reads with the same 5' unclipped alignment start. This is especially true for RNAseq.

So yeah, please re-run Picard with the extra option MINIMUM_DISTANCE=200 If you still get the error, increase the 200 to 1000 or something... but 200 should be fine.

ADD REPLY
1
Entering edit mode

It seems like working now :) Thank you so much :)

ADD REPLY
0
Entering edit mode

Dear John,

I have upgraded all my software tools, BWA, Picard Tools, and GATK. Now I am having the same problem again that .bai file is malformed. This happens to some of my samples. I will increase MINIMUM_DISTANCE to some thing larger than 200, but is there any negative consequence to set very large number for this particular parameter? Can I set any integer for this parameter?

Thank you very much.

ADD REPLY
0
Entering edit mode

In fact, in the error or log file, I found:

Exception in thread "main" picard.PicardException: Found a samRecordWithOrdinal with sufficiently large clipping that we may have missed including it in an early duplicate marking iteration. Please increase the minimum distance to at least 205bp to ensure it is considered (was 200).

But is there any down side for increasing this minimum distance?

ADD REPLY
0
Entering edit mode

Well there's certainly a downside for not increasing the minimum distance - which is you're still stuck :P

But no I don't think it will make a difference. I think it just means Picard will use a little more memory when looking for duplicates and there's no downside at all in terms of the final result... which is why previously i said that Picard is to blame and not you, because the implementation should just figure this stuff out for you.

ADD REPLY
0
Entering edit mode

Hi John,

Thank you very much for the explanation. I increased MINIMUM_DISTANCE to 550, and gave each job 50G RAM. Things are running fine, but it seems like taking very long to get Picard done. Do you think increasing the minimum distance slows down the process, because they have more to look at?

Thank you so much.

ADD REPLY
0
Entering edit mode

mrhmhm... I wouldn't have thought so. But I could be wrong. If it doesn't finish by tomorrow I would send an e-mail to the author's of Picard and get their opinion. You can just point them to this thread :)

ADD REPLY
0
Entering edit mode

I don't remember having this problem when I recently used MarkDuplicates (Picard v. 2.1.1). So if OP is not using the latest Picard then the easy solution would be to upgrade to the latest Picard.

ADD REPLY
0
Entering edit mode

Dear John,

You seem to be very knowledgeable. So I would like to ask you one more question.

On the website: Piping Markduplicates It is said that markduplicate tool in Picard does not support pipes. But in fact, I did use named pipes for all Picard tools. I compared the results from the results without any pipes, and they seem to be identical.

I am wondering how it can be possible to do duplicate identification in pipe. Pipe is getting only a portion of file, and how can we decide if a sequence is duplicate or not if we do not have full file? Or is it that the full file is piped in to a large buffer?

ADD REPLY
0
Entering edit mode

Hi Haiying :) Glad to hear things are going well for you now if the only problem is temporary files/piping!

If you managed to get MarkDuplicates to work on named pipes, then that is surprising. Perhaps Picard MarkDuplicates now supports pipes? In the link you posted in your comment, Pierre says MarkDuplicates does not support pipes, however that was written 3 years ago.

The problem with duplicates and pipes is that with paired-end data we need both pairs to mark it as a duplicate (or not). The first read in a pair could map to chr1:10000, and the last read in the pair map to chrY:999999. This means that, when reading from a pipe, we need to keep the first read in memory for the whole time we look for duplicates, because we cannot read the first read again. Actually it's worse than that, because it means we cannot write the first read to the output file until we know that it definitely is/isn't a duplicate. So ALL reads from chr1:10000 to chrY:999999 have to be stored in memory.

However, if we read data from a file, this is not a problem. The first read will say that the last read maps to chrY:999999, so we can jump straight to chrY:999999, then jump back to where we were before. Typically, a program like MarkDuplicates will do that sort of extra work when it looks like it's going to run out of memory. Since you have given Picard 50Gb of memory, perhaps when reading from your named pipe it never needed to do a "jump" like this, it could keep everything in memory the whole time (which is much faster than doing a jump) and so there where no problems.

In other words, I wouldn't be surprised if Picard can read from a pipe if you give it a lot of memory --- but it will depend on the data.

The real solution to this problem is re-writable BAM files, since then you don't need to mess around with pipes at all. You just modify in-place your original BAM file. I started layout out a rewritable BAM format last year, but the project lost interest :(

ADD REPLY
1
Entering edit mode

Rewritable BAM file can be very challenging, but can be very helpful. Until it is available, I have to live with temp files or pipes :)

ADD REPLY
0
Entering edit mode

Dear John,

Thank you very much for your quick reply.

The files in sam format are 25-35G, no more than 40G. Should this be safe if I give 55G? I found that regardless I use pipe or not, and regardless how much memory I assign to with PBS, when Picard is running, if I check with top, the memory is almost all used. I have total 250G on my machine, and the used memory is always very close to this. These are WESs, and for WGSs, I will not use pipes for Picard markduplicates.

Also, from GATK output, which files are worth of saving? I do preprocessing with BWA, Picard, and GATK. .bam and .bai files from GATK Base quality score recalibration for sure. What else should I keep?

Thank you so much.

ADD REPLY
0
Entering edit mode

I can't say anything for sure because I never used pipes :( I was lazy and just wrote things to disk. I also only have 64Gb RAM. With GATK i also did BQSR and indel-realignment, but many people say that these days you don't need to do indel-realignment due to the callers doing built-in local realignment. But I wasn't doing indel-realignment just for calling SNPs, but also for having better data for ChIP-Seq, etc.

Anyway, as for what i kept after doing all this work, I just kept the original unmapped BAMs (all the same data as the FASTQ), and the final output BAM which had been mapped with BWA, no duplicates, BQSR'd and indelRealigned, with the old quality scores included in the final "clean" BAM. I'm not sure keeping the old quality scores in the final clean.bam was a good idea though, since it is in the unmapped BAM, so i should have used the option to leave that out. All intermediary files were deleted. This can feel a bit upsetting when they took so long to generate, but it's OK :)

ADD REPLY
1
Entering edit mode

But by using pipes for BWA and Picard, I can save more than a third time, so I would prefer to use pipes :) I test ran on truncated fastq, only 40,000,000 lines paired data. And the results from no pipe and with pipe are exactly same.

Now I am running on full data, and will let you know if it works.

Thank you so much for your help.

ADD REPLY
0
Entering edit mode

Dear John,

I submitted 8 jobs before I left yesterday, 4 jobs running simultaneously. Only 3 of them completed without error. In addition, I had another job running on another machine. Here is comparison table for bam and bai files running with and without pipes. I should decompress bams and compare sam file size, but it could take much time.

           FileName         FileSize_WithoutPipe    FileSize_WithPipe
1 B100327.recal.bai              6796720           6796960
2 B100327.recal.bam          13332695788       13332687801
3 B102360.recal.bai              6802984           6803016
4 B102360.recal.bam          12228462715       12230671940
5 B102513.recal.bai              6751800           6751784
6 B102513.recal.bam          11917276543       11917266800
7  B83115.recal.bai              6970736           6970688
8  B83115.recal.bam          15267059755       15267058789

I think it is not very secure to give input to Picard mark duplicates as you pointed out. Even for these jobs ended without error, the file sizes are not matching very well. In my experience, .bam with pipe are slightly larger than .bam without pipe, I do not know why, even if they are identical when look at their sam formats.

For the failed 5 jobs, in their log file, it is always mentioned:

"Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fd3ac000000, 20854079488, 0) failed; error='Cannot allocate memory' (errno=12)
[fputs] Broken pipe
"

I assigned 55G to each job, but I think Picard always takes all memory available at that time. So when another job tries to take 55G, there is not enough memory, I think.

I decided to give input to picard markduplicates from .bam and .bai files, not with pipe.

I am also using Picard SortSam, and AddReadGroups. Can I use pipes when I give these tools input? I feel these tools do not have problem with pipes, because they do not need to keep full data in memory.

Thank you so much John.

Best.

ADD REPLY
0
Entering edit mode

Heheh, you checked the output differences with/without pipes - distrust of the tools makes you a good bioinformatician :) Unfortunately, it is bad luck that this time it actually creates different results - that shouldn't happen. Hm.

So what I would do is run something like this:

diff --side-by-side --suppress-common-lines <(samtools view with_pipe.bam) <(samtools view without_pipe.bam)

That will give you a list of rows that are different between the two BAM files, without having to write everything to disk as SAM. If it completes without showing anything, the BAM files are same in content, but different in encoding only.

ADD REPLY
0
Entering edit mode

Dear John,

Thanks for the suggestion for comparing 2 bams.

I am running BWA and Picard again. This time no pipe for input to Picard markduplicates. But I am using pipe from BWA output to Picard SortSam, also from Picard MarkDuplicates to Picard AddOrReplaceReadGroups. Would you think piping in this 2 occasions should have no issue with memories? To me it seems to be fine. SortSam and AddOrReplaceReadGroups should be able to process what entered in pipe and release results after processing. These 2 tools do not need to hold whole data for a sample in buffer. Am I right?

ADD REPLY

Login before adding your answer.

Traffic: 1972 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6