Since I read your question for the first time, the feeling was bothering me that the approach you want to take is invalid. I have been thinking about a solution since then and now see the need to supply an answer. However, that might not be the answer you would like to have, some critical points also due to missing information in your question.
In one sentence: you will make up (fake) read alignments, indirectly claiming that reads map in a position of genome B, while they in fact don't align there and could only be aligned to genome A.
First, I have to question your statement:
Genome A and genome B are too
different to align the short reads
from A straight to genome B, but close
enough that I can align most of the
thousands of assembled contigs from
genome A to genome B
How can you possibly claim that with only trying bwa? And what is the real distance between those genomes, what is the sequence identity, and what is the degree of genome rearrangements/syntheny?
One of the core problems of your approach is using bwa to align the reads. bwa has low sensitivity, especially when it comes to more different sequences. So, try a more sensitive aligner first to map the reads directly, eg. bfast, blat, mosaic, novoalign, lastz, or try LAST (thanx to darked's comment).
Why is the mapping approach invalid? There are two approaches I could think of:
-
globally align whole contigs of genome A against Genome B (pretty much treating it like a large short read), yielding one mapping coordinate, ambiguity asside, for the contig of A in genome B.
-
locally align the contigs such that partial contigs of A, with minimum sequence identity cut-off, within a contig can be aligned to a region in B, yielding possibly a set of genomic coordinates in B per contig.
Both approaches have in common that they don't deal with ambiguities, and that is a big deal.
Then, I claim that approach 1 in invalid and will create bogus mapped coordinates of reads, assigning them to regions where definitely do not align.
Only one example of what could go wrong: Imagine the following situation, a contig of genome A, consisting of three regions R1, R2, R3 of high sequence conservation (think of that for example as three highly conserved genes). All these regions are present in genome B as well (R1', R2', R3'), but due to a rearrangement, R2' is replaced by R4' and relocated to a different position. Now, following your relatively low sequence-identity threshold R1,R2,R3 -> R1',R4',R3'. Now for genome A, you got a lot of reads for R2, which now will be assigned to R1',R4',R3' but they will be assigned to region R4', where they definitely do not belong to.
Based on the above, the second approach would be much better, that would be best accomplished by a bidirectional best blast hit between the genomes (with eg. identity cut-off 90%), and then mapping those regions. Would also need to treat ambiguities somehow, and again bwa is not the tool of choice.
However, if stretches of high sequence similarity existed, then the reads would possibly also align there directly.
Conclusion: Before thinking about how to technically perform an analysis, we have to think more about the validity of the approach. Lacking some important pieces of information, the above couldn't be the final word there. I consider mapping read alignments between distant genomes problematic. The risk is to build up the impression (or tricking yourself into believing) that there exist alignments where in fact they couldn't be aligned, and that ambiguities go unnoticed. Only alignments that can be found and proven are valid alignments, if your reads don't align, yes, then they don't align.
Try more sensitive tools to align reads directly.
Why not re-align the reads to the new reference?
What do you use for genome A vs genome B alignment? What is the format of the output? Not sure about your genome, but often one can get MAF file where multiple regions of A map to multiple regions of B.
what do you mean by 'projection of coordinates' to the new contigs and in what way is that different than mapping the reads to the new contigs?
this would be easier if there were clear rules about dealing with ambiguities - multiple hits and converging hits
Genome A and genome B are too different to re-align the short reads from A straight to genome B.
@Jeremy Leipzig: Supposing one has already dealt with most of them using "bwa aln" to map the short reads and "bwa bwasw" to map the contigs, I would be happy to just copy the whole lot on the other coordinates.
@avilella, if you can write out the specifics of how you want this method to work then I can probably implement it for you (most likely in R) and get this question answered within 6 days.
I've now added an example dataset using lastz for mouse contigs against human chromosomes. From the lastz soft sam I created a bam. In parallel, I created the bam pileup of the mouse reads into the mouse contigs.