Hi all,
I was wondering if there is a tool that would allow merging BAM files from totally different references. (in case you wonder how could this possibly be useful: if you have a large reference that takes a long time to build, splitting it into e.g. 16 smaller chunks speeds things up dramatically).
The way I see it, 3 things need to be adjusted: 1) header, which is trivial; 2) flags and mapping qualities (single mapper can become a multi-mapper); 3) additional tags that track multi-mapping, like NH, HI, etc. The latter two are fairly hard to do effectively, so if a know tool exists, it would be nice to give it a try.
Any suggestions would be appreciated, as always.
-- Alex
Is the same dataset aligned to two references? If not, there would be no way to do this based on alignments in the files. Even if the same dataset is aligned to two references it would be a stretch to fiddle with the alignments to do the above.
May be best to bite the bullet and re-do the alignments with merged reference.
It's the exact same dataset, yes. The problem is, this is a part of a pipeline, and the reference is variable (e.g. 2-3k bacterial genomes). Indexing and mapping separately would have been wonderful, but the merging part seems complicated.
Ouch so this is 2-3K separate BAM files? Since these genomes will share short sequences the multi-mapping situation will get out of hand quickly if you want to merge all these files.
I was thinking ~ 16 chunks of few hundred genomes - this usually works nice and smooth. So, only 16 BAM files.
If you split the reads into chunks, and aligned to the same reference genome, that is fine.
But if my understanding is correct, you aligned the same reads to multiple reference genomes. How are you going to deal with the same read could have multiple primary alignment to multiple different genomes? Are you going to pick one based on mapping quality, and randomly break ties, and manually modify MAPQ?
I need to do read counting with featureCounts after, so, from what I realise, all I need is the correct NH tag.
Since you mention
featureCounts
it is not going to count multi-mapping reads by default. What kind of counting are you planning to do? Are you planning to count the multimappers (with-M
switch)? Which aligner are you using and does it mark multi-mappers in hundreds of locations (bbmap.sh
withambig=all
explicitly does that but may need to be tested with hundreds of genomes).Yep, I'm using
hisat2
with heavily modified presets, and-M -O --fraction
forfeatureCounts
. It's reporting alignments in many (up to 2k in my case) locations, and I even had to re-compilefeatureCounts
to stop it from rounding the counts because anything less than 1/100th was regarded as 0. All of that works OK, just need it to faster.