bwa mem perfectly align reads over their entire length
1
4
Entering edit mode
9.8 years ago
goldberg.jm ▴ 90

I would like use bwa mem to identify reads that align perfectly or almost perfectly with the reference genome over the entire read. I have tried:

bwa mem -B 40 -O 60 -E 10 H2007/genome.fa Left.fq Right.fq > stringent.sam.

This gives nearly perfect aligned regions, but the result includes many reads that align over only an internal contiguous region with their ends flapping around in the breeze. I would like set bwa mem to exclude these.

I have also played with the -T and -k flags and spent many minutes googling and reading, but no joy.

I would prefer not to have to resort to unix scripts to address this.

Thank you,

Jon

alignment • 13k views
ADD COMMENT
1
Entering edit mode

Have you tried increasing the penalty for end clipping (-L)?

-L INT[,INT] penalty for 5'- and 3'-end clipping [5,5]

ADD REPLY
2
Entering edit mode

Thanks for the tips! I increased the -L clipping penalty like so:

bwa mem -B 40 -O 60 -E 10 -L100 genome.fa Left.fq Right.fq > stringent.sam

I got rid of unmapped and supplemental alignments like so:

samtools view -b -f 2 -F 2316 stringent.bam > mapped_primary.bam

This helped but did not entirely eliminate partial alignments. Counting CIGARs = 101M (my reads are 101nt):

Default -L [5]: 70.1% full length alignments

For -L100: 77.4% full length alignments

It looks like I can keep raising -L to clear more partial alignments. The clipping penalty is subtracted from the score. Does anyone know where bwa's alignment score appears in the sam file? It does not seem to be the AS:i:xxx field, which seems to be just the number of matches from the CIGAR field. Maybe I'll end up going the unix route, but the more manipulations the more chance for errors.

Thanks again,
Jon

ADD REPLY
1
Entering edit mode

I -L is a good advise. Alternatively you can identify partially mapped reads in SAM by cigar. Partial hits start and/or end "S"/"H" (depending on your clipping settings). E.g. "10S75M15S" could be a perfect match of 75bp with 10 bp in front and 15bp at the end unaligned.

ADD REPLY
1
Entering edit mode

Hi Jon,

Since you are looking for global rather than local alignments, you might find that you have better control with a global aligner such as BBMap. In this case, you could achieve the result you want with a command something like this:

bbmap.sh ref=genome.fa in1=Left.fq in2=Right.fq minid=0.95 maxindel=1 outm=mapped.sam

Hanging off the end of a scaffold is penalized by BBMap, not as much as mismatches, but about half as much - so those alignments won't show up. If you only want pairs where both are mapped you can add the "po" (pairedonly) flag. "outm" will only capture the mapped reads (keeping pairs together).

If you really only want perfect matches, then add the flag "perfectmode" instead of "minid=0.95 maxindel=1". "perfect" is defined by BBMap as not going off the end of a contig. "semiperfectmode" on the other hand allows going off the end but does not allow any mismatches.

ADD REPLY
0
Entering edit mode

Thanks all, I am currently testing bwa mem ...-L 10... (default is 5). The hard clipping filter bwa mem ...-H... does not do anything at all. Just gives back the command prompt (bwa_0.7.10).

A question about using the CIGAR field for filtering. Can CIGAR filters be applied directly from within samtools? I don't think so. I can filter the CIGAR field on -v '\b[0-9]+S|H\b' or '\b101M\b' using unix (I happen to have 101nt reads), but I think this will disrupt pairs when one but not the other is fully aligned. For subsequent analysis steps (e.g. IDBA_ud )I would like to keep pairs together.

ADD REPLY
2
Entering edit mode
9.8 years ago
piet ★ 1.9k

the result includes many reads that align over only an internal contiguous region

It may be easier to remove those partial aligned reads from the SAM file instead of avoiding them altogether. Reads aligned only partially will have 'H' or 'S' in their CIGAR string (Hard or Soft clipped). It should not take much effort to filter them out.

ADD COMMENT

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6