I would like use bwa mem to identify reads that align perfectly or almost perfectly with the reference genome over the entire read. I have tried:
bwa mem -B 40 -O 60 -E 10 H2007/genome.fa Left.fq Right.fq > stringent.sam.
This gives nearly perfect aligned regions, but the result includes many reads that align over only an internal contiguous region with their ends flapping around in the breeze. I would like set bwa mem to exclude these.
I have also played with the -T and -k flags and spent many minutes googling and reading, but no joy.
I would prefer not to have to resort to unix scripts to address this.
Thank you,
Jon
Have you tried increasing the penalty for end clipping (-L)?
Thanks for the tips! I increased the
-L
clipping penalty like so:I got rid of unmapped and supplemental alignments like so:
This helped but did not entirely eliminate partial alignments. Counting CIGARs = 101M (my reads are 101nt):
Default -L [5]: 70.1% full length alignments
For -L100: 77.4% full length alignments
It looks like I can keep raising
-L
to clear more partial alignments. The clipping penalty is subtracted from the score. Does anyone know where bwa's alignment score appears in the sam file? It does not seem to be theAS:i:xxx
field, which seems to be just the number of matches from the CIGAR field. Maybe I'll end up going the unix route, but the more manipulations the more chance for errors.Thanks again,
Jon
I -L is a good advise. Alternatively you can identify partially mapped reads in SAM by cigar. Partial hits start and/or end "S"/"H" (depending on your clipping settings). E.g. "10S75M15S" could be a perfect match of 75bp with 10 bp in front and 15bp at the end unaligned.
Hi Jon,
Since you are looking for global rather than local alignments, you might find that you have better control with a global aligner such as BBMap. In this case, you could achieve the result you want with a command something like this:
Hanging off the end of a scaffold is penalized by BBMap, not as much as mismatches, but about half as much - so those alignments won't show up. If you only want pairs where both are mapped you can add the "po" (pairedonly) flag. "outm" will only capture the mapped reads (keeping pairs together).
If you really only want perfect matches, then add the flag "perfectmode" instead of "minid=0.95 maxindel=1". "perfect" is defined by BBMap as not going off the end of a contig. "semiperfectmode" on the other hand allows going off the end but does not allow any mismatches.
Thanks all, I am currently testing
bwa mem ...-L 10...
(default is 5). The hard clipping filterbwa mem ...-H...
does not do anything at all. Just gives back the command prompt (bwa_0.7.10).A question about using the CIGAR field for filtering. Can CIGAR filters be applied directly from within samtools? I don't think so. I can filter the CIGAR field on
-v '\b[0-9]+S|H\b'
or'\b101M\b'
using unix (I happen to have 101nt reads), but I think this will disrupt pairs when one but not the other is fully aligned. For subsequent analysis steps (e.g. IDBA_ud )I would like to keep pairs together.