Hi, in my alignments by bwa mem I have found some spots where there is very high number of mismatches. I think these alignments are not correct and might stem from low quality reads or whatever.
My bam/sam files are generated by bwa mem and thus I decided to filter them based on the alignment quality and other parameters. First what I noticed is that the MAPQ score from sam spec https://samtools.github.io/hts-specs/SAMv1.pdf is set to 60 for all reads. Thus I assume that bwa mem does not use this parameter and thus I can not filter by MAPQ.
The parameters that bwa mem adds to the resulting bam/sam files are MD:Z:45C12T2A3T31T10C109T48 NM:i:7 AS:i:232 XS:i:19. I found their meaning on this page http://bio-bwa.sourceforge.net/bwa.shtml
- MD Mismatching positions/bases
- NM Edit distance
- AS Alignment score
- XS Suboptimal alignment score
I did not find any further documentation for these parameters. Do you know precisely how to understand each of them? Edit distance of indel is probably higher than edit distance of mismatch since if I set Edit distance too low I do not see any INDELs.
The best solution for my problem would be to control "Number of mismatches" in given read. According to http://bio-bwa.sourceforge.net/bwa.shtml this is in XM parameter. But this parameter is not present in my bam/sam files. Do you know how to add this parameter to bwa mem output or how to control output based on number of mismatches?
Thanks, Vojtech.
P.S: Filtering based on custom sam/bam parameters is not straigthforward. I have found this solution http://seqanswers.com/forums/showthread.php?t=24123 and corrected it to work partly but using awk and assuming fixed field number is not the best way to do these things. So I wonder if there is some kind of program that does filtering based on custom fields. Suggested pysam looks very low level for me, I would like some command line utility.
it's easy, using my tool https://github.com/lindenb/jvarkit/wiki/SamJS which has recently moved to picard FilterSamReads : https://broadinstitute.github.io/picard/command-line-overview.html
did you try to use the NM tag (edit distance ) ? see also How To Find Out The Mismatchs Of An Alignment Entry In The Sam File?
If this is really prevalent, I would trying GATK's IndelRealigner to see if it can fix things up a bit. While BWA tries to map each read on a read-by-read basis, with no consideration given to how many or well other reads mapped to that location, IndelRealigner tries to remap regions where many reads mis-match, indicative of an indel. Your data looks like a classic indel to me, since that is way too many mismatches for a modern sequencer.
Filtering the BAM should really be a last resort. Before that, you should see if your downstream tools even acknowledge these reads, or if they pass those tools built-in filters. Best to keep all your data in one place...
In the samtools spec I found MD tag description
I think its something I can use in mismatch filtering since if I parse this tag correctly, I can count number of mismatches quite well. My problem is twofold. First I did some softcliping of bam alignments not to contain primer sequences of amplicons our dept is using but I did not change MD tag. Thus I need to do that. Fortunately I can do that using command
samtools calmd -rb incorrectmd.bam ref.fa > correctmd.bam
-r is for adding another tag called BQ in which there is called Base Alignment Quality (BAQ) and described here http://samtools.sourceforge.net/mpileup.shtml