Dear all,
we are trying to combine two major eukaryotes' genomes into a single bowtie2 index, in order to align reads (from hetero-binucleated cells) onto a single reference. We need to reference because, being the species not so distantly related, we want to eliminate the bias of reads from one nucleus aligning on both genomes, while in fact the best alignment for most of them is only one. One solution is to post-process the two BAMs checking for the highest Q for each read (discarding or reweighting the ones still ambiguously mapped). But the cleanest and less biased solution would be to align all of them at once versus the joined genomes.
Sadly, the size of the two genomes in our case hits the bowtie2-build size limit
Error: Reference sequence has more than 2^32-1 characters! Please divide the
reference into batches or chunks of about 3.6 billion characters or less each
and index each independently.
We can build two separate indeces, but then we would like to align the reads at the same time on both of them using Tophat2. Is it possible to do so? Do you think there can be another solution for our problem. E.g. run two separate tophat2, then join the BAMs in a way that keeps only the best alignment(s)?
Thanks a lot for your solutions/opinions :-)
Can you elaborate little on what you are looking for and why you need the second reference genome.
Done. Essentially, we want to assign the correct read to the correct genome of origin in the most possible unbiased way (same run alignment).
As DHaeseleer has answered below if you can index everything together that would be great. Otherwise, comparing the read alignment to different genomes is also a feasible option. As DHaeseleer said that MAPQ is not a good option to compare the alignment, I would compare number of mismatches when the same read was aligned to different genomes.
It is how it will be done:
BWA will give you BAM files sorted by query name. This order will be same as fastq file in default settings and will be same for all the BAMs file (no matter which genome used) . You can write a code that can read either one (single end library) or two rows (paired end) from both the SAM file and compare number of mismatches and then assign them to one genome depending on the mismatches.