Hi!
I'm trying to build an efficient pipeline for processing amplicon sequencing data. The problem is that ValidateSamFile reveals a bunch of errors in BAM files after running BamClipper (whereas BAMs were free of errors before). Exemplary output of ValidateSamFile (MODE=SUMMARY):
HISTOGRAM java.lang.String
Error Type Count
ERROR:INVALID_FLAG_SUPPLEMENTARY_ALIGNMENT 138
ERROR:INVALID_MAPPING_QUALITY 315
ERROR:MISMATCH_FLAG_MATE_UNMAPPED 217
ERROR:MISMATCH_MATE_ALIGNMENT_START 8775
ERROR:MISMATCH_MATE_CIGAR_STRING 2385125
WARNING:MISSING_TAG_NM 2387464
I've read that MergeBamAlignment is a powerful tool for cleaning BAM files while preserving original read information and base quality scores. So I decided to implement the GATK's tutorial #6484 into my analysis pipeline to get rid of the errors.
I just want to ask the community's opinion about the following workflow:
I could have missed something. Any critical thoughts are welcome.
If I am reading the flow diagram right, why are you adding unaligned BAM data back into final BAM? Isn't that duplicating many reads (aligned and original copy).
GATK claims that
I see. Have you compared the merged BAM with the aligned BAM to see what
MergeBamAlignment
did?