Question

Running Tophat2 On Two Joined Genomes

1

Entering edit mode

11.6 years ago

Federico Giorgi ▴ 740

Dear all,

we are trying to combine two major eukaryotes' genomes into a single bowtie2 index, in order to align reads (from hetero-binucleated cells) onto a single reference. We need to reference because, being the species not so distantly related, we want to eliminate the bias of reads from one nucleus aligning on both genomes, while in fact the best alignment for most of them is only one. One solution is to post-process the two BAMs checking for the highest Q for each read (discarding or reweighting the ones still ambiguously mapped). But the cleanest and less biased solution would be to align all of them at once versus the joined genomes.

Sadly, the size of the two genomes in our case hits the bowtie2-build size limit

Error: Reference sequence has more than 2^32-1 characters!  Please divide the
reference into batches or chunks of about 3.6 billion characters or less each
and index each independently.

We can build two separate indeces, but then we would like to align the reads at the same time on both of them using Tophat2. Is it possible to do so? Do you think there can be another solution for our problem. E.g. run two separate tophat2, then join the BAMs in a way that keeps only the best alignment(s)?

Thanks a lot for your solutions/opinions :-)

tophat2 rnaseq bowtie bowtie2 • 2.4k views

ADD COMMENT • link updated 11.6 years ago by DHaeseleer ▴ 10 • written 11.6 years ago by Federico Giorgi ▴ 740

0

Entering edit mode

Can you elaborate little on what you are looking for and why you need the second reference genome.

ADD REPLY • link 11.6 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Done. Essentially, we want to assign the correct read to the correct genome of origin in the most possible unbiased way (same run alignment).

ADD REPLY • link 11.6 years ago by Federico Giorgi ▴ 740

1

Entering edit mode

As DHaeseleer has answered below if you can index everything together that would be great. Otherwise, comparing the read alignment to different genomes is also a feasible option. As DHaeseleer said that MAPQ is not a good option to compare the alignment, I would compare number of mismatches when the same read was aligned to different genomes.

It is how it will be done:

BWA will give you BAM files sorted by query name. This order will be same as fastq file in default settings and will be same for all the BAMs file (no matter which genome used) . You can write a code that can read either one (single end library) or two rows (paired end) from both the SAM file and compare number of mismatches and then assign them to one genome depending on the mismatches.

ADD REPLY • link 11.6 years ago by Ashutosh Pandey 12k

score 1 · Answer 1 · 2013-04-19

The idea of merging the two genomes is theoretically the best one, as it will remove at once all alignment biases. Joining the BAMs afterwards (keeping the highest Q per read/or per pair) is potentially dangerous, as the Q scores can be calculated differently given different reference features. The only solution I can think of is to use BWA on spliced versions of the genes from the two genomes (i.e. cDNA) as BWA supports high size genomes since version 0.6

P