I repeatedly come accross "MAPQ should be 0 for unmapped read." issue. Yes, I can set stringency as lenient in Picard and thus avoid evaluating this as an error, but still - I use another program and it is here again.
Is anybody aware of some tool for fixing this problem for bam files generated using BWA?
I have found things like CleanSam.jar, FixMateInformation and others, but none helped. I do not want to supress error, I would like to fix it:)
I put a comment in the other answer here. The issue is that there are unmapped reads (that is, with the unmapped read flag set) that have non-zero mapping qualities.
You could fix it by removing all unmapped reads, but that's heavy-handed and no good if you want to use them later for other analyses. It will fix the error, though. The command to do that is:
samtools view -bF 4 raw.bam > filtered.bam
The most correct way to fix it would be a short script that goes through the SAM/BAM and sets the MAPQ to 0 if the unmapped flag is set. I'll leave that as an exercise...
just to clarify this point, I see on the samtools docs that "-F INT Skip alignments with bits present in INT [0]", so if we use "-F 4" this will be removing the unmapped reads? also, I've always considered MAPQ=0 as mapping errors due to multiple possible genome locations. if so, would it be acceptable to use the following command to cover all possibilities?
Yes, that would do what you want (exclude multi-mappers and unmapped reads). Excluding unmapped reads is sort of a strange thing, in that most tools should do the right thing when you give them unmapped reads. But filtering them can help if you have strict tools like Picard that will complain otherwise.
To me, filtering multi-mapping reads is for another purpose (but equally valid and necessary in some situations). That is, specific downstream analyses, not just avoiding Picard complaints.
sure, that's the first step we do before starting the whole "best practices" downstream analysis suggested by the GATK team. in our alignment BAMs we weren't getting the unmapped reads, so that's why we never came across this "-F 4" option, but since our final goal is to detect variants we thing that this initial step to remove anything not useful (multi-mapping and unmapped reads fit that description in our opinion) does help to slightly reduce the computer resources needed (at least it reduces the size of the BAM file).
Just ignore the warning. The actual fix would be to repair Picard, because that warning is simply wrong. Nothing in the SAM spec says that unmapped reads should have a MAPQ of 0. If a read maps unambigously to a nonsensical position, it should have non-vanishing MAPQ and the 'unmapped' flag set.
(On the other hand, I totally understand why Picard complains, it's perfectly reasonable. It still doesn't match the spec... therefore the SAM spec, especially the "best practices" section, is unreasonable. But that's a different question...)
+1 for letting me know that -h is not needed when binary format is on. thanks a lot for the correction. I'll write my coment directly on your question then.
I chose to fix this myself. Just like you, I did not find any tool that fixes all the Picard complaints.
So what I did is to write a wrapper around the bwa sampe call and I intercept all the pairs coming out and fix flags and tags on the fly.
The only way you can know whether your read is unmapped or not is to check the corresponding flag bit. Once you caught the info, updates parts of the SAM alignment that need to be updated. And moreover, if you are dealing with paired-end experiments, as you are holding the alignment of the pair at the very same time, you can even update the mate information as well in the mate alignment & vice-versa.
Depending on the number of checks you want to do and your piece of code, it can increase (double sometimes in my case) the computation time, but at the end of the process, you have a clean file.
I was in the same case with a bamset files mapped with bwa. Lot of "MAPQ should be 0 for unmapped read." warnings come during GATK process. I solve this problem by using picard CleanSam tool. As it is write on the CleanSam manual:
$ picard-tools CleanSam
ERROR: Option 'INPUT' is required.
USAGE: CleanSam [options]
Documentation: http://picard.sourceforge.net/command-line-overview.shtml#CleanSam
Read SAM and perform various fix-ups. Currently, the only fix-ups are 1: to soft-clip an alignment that hangs off the end of its reference sequence; and 2: to set MAPQ to 0 if a read is unmapped.
Version: 1.105()
I notice you have tried this solution but it not resolve your problem. This is curious, maybe a different version.
just to clarify this point, I see on the samtools docs that "-F INT Skip alignments with bits present in INT [0]", so if we use "-F 4" this will be removing the unmapped reads? also, I've always considered MAPQ=0 as mapping errors due to multiple possible genome locations. if so, would it be acceptable to use the following command to cover all possibilities?
Yes, that would do what you want (exclude multi-mappers and unmapped reads). Excluding unmapped reads is sort of a strange thing, in that most tools should do the right thing when you give them unmapped reads. But filtering them can help if you have strict tools like Picard that will complain otherwise.
To me, filtering multi-mapping reads is for another purpose (but equally valid and necessary in some situations). That is, specific downstream analyses, not just avoiding Picard complaints.
sure, that's the first step we do before starting the whole "best practices" downstream analysis suggested by the GATK team. in our alignment BAMs we weren't getting the unmapped reads, so that's why we never came across this "-F 4" option, but since our final goal is to detect variants we thing that this initial step to remove anything not useful (multi-mapping and unmapped reads fit that description in our opinion) does help to slightly reduce the computer resources needed (at least it reduces the size of the BAM file).