My bam has a lot of bad reads that cause it to fail the GATK. I would like to remove them. How can I programmatically remove the reads identified by ValidateSamFile as causing errors?
My bam has a lot of bad reads that cause it to fail the GATK. I would like to remove them. How can I programmatically remove the reads identified by ValidateSamFile as causing errors?
Assuming you are happy to discard the failed reads rather than correcting them, you could set the MAX_OUTPUT option to a large value so to get a list of failed records. If I'm not mistaken you get the record position in the file, like (example from here):
ERROR: Record 1, Read name 20FU...
ERROR: Record 3, Read name 20FU...
ERROR: Record 6, Read name 20GA...
Then pass through the file again and discard the records failing records. This may require writing a little script that parses the output of ValidateSam to get the record numbers to discard (1, 3, 6, ... in the example above) and then read and write the bam file excluding those indexes. (Maybe there is an off-the-shelf tool for all this...)
If you have paired end reads, you may create reads that have no mate which in turn makes the bam file still invalid. I'm not sure if samtools fixmate
can fix that.
But again, in practice it may be easier and better to recreate the bam files without broken records in the first place...
using samjdk: http://lindenb.github.io/jvarkit/SamJdk.html
java -jar samjdk.jar -e 'List<SAMValidationError> errors = record.isValid(false);return (errors==null || errors.isEmpty());' input.bam
or you can ask GATK to be lenient with errors. I think it's -S LENIENT
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi Pierre, this is great - can you give an example of what <samvalidationerrors> is supposed to look like? And can this tool also remove the mate of a read that is failing?
Also, with regards to another question, could one use this tool to remove reads where the read ID occurs more than twice? I have some legacy bams with bad formatting I am trying to work with. Thanks!
<samvalidationerrors> is not a placeholder but a concrete java class https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/SAMValidationError.java
ask this as a new question. Search biostars if it was asked before.