BBMap Best Possible Mapping Output and Ambiguous Meaning
1
0
Entering edit mode
22 months ago

Hi folks,

I am trying to map reads to a set of metagenomes where I know that certain genes are duplicated but I cannot remove some of the duplication even with dedupe.sh. In order to improve detection and accurately represent my mapping, I want to find all sites where each read maps at the highest mapping quality. In other words, if the read mapped to a location A and B at MAPQ=30 and mapped to location C at MAPQ=25, I'd want to only report A and B mapping.

From my understanding, the BBMap ambiguous=all parameter would output A, B, and C mapping (as long as they are all in the clearing zone). Would ambiguous=best be a better fit for my purpose?

Please let me know if there is a better option for this!

Thanks!

BBMap Bushnell Brian GenoMax • 1.1k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode
22 months ago
GenoMax 145k

I think of a couple of options. If you can live with A or B mapping then use ambig=random. That would select one site randomly from all the "best" sites. But if you need to keep both A and B then it may be best to try and filter your BAM using reformat.sh afterwards

minmapq=-1              If non-negative, toss reads with mapq under this.

By default bbmap does not show secondary alignments. Not sure if you need to consider those.

but I cannot remove some of the duplication even with dedupe.sh

You could use clumpify.sh to keep only unique reads (and keep their counts in the fastq header) before aligning them.

ADD COMMENT
0
Entering edit mode

Thanks for the feedback! Ideally, I'd like to keep both A and B. Given that each read would have a different distribution of MAPQ values, would picking a single MAPQ cutoff for all reads work?

The duplication is in the contigs and I already used dedupe.sh with default parameters but there are still duplicated and contained contigs in the final assembly. Is there any way I can resolve these duplications?

Thanks!

ADD REPLY
0
Entering edit mode

would picking a single MAPQ cutoff for all reads work?

Perhaps. You will need to experiment. Probably not be the perfect solution that you are looking for.

You can try using clumpify.sh on the contigs. It can look for containments so that may be able to address that part.

containment=f       Allow containments (where one sequence is shorter).

I assume you have also tried CD-HIT?

ADD REPLY

Login before adding your answer.

Traffic: 2731 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6