After gathering some comments about this in the previous post, I would like to know where people think this feature could best fit into:
Having a BAM file 'short-A' from short reads against the thousands of assembled contigs of genome A, one wants to project the entries in BAM file 'short-A' onto the chromosomes in genome B. Genome A and genome B are too divergent to directly align the short reads from A straight to genome B, but one can align most of the thousands of assembled contigs from genome A to genome B, producing a 'contigs-A' file against genome B. This method would place each of the millions of short reads in my 'short-A' BAM file in the coordinates for genome B using the 'contigs-A' alignments as the guide, by projecting each position in the cigar lines of the coordinate system of one alignment set to the other.
For what I've googled around, I think this could go into:
- Samtools
- Bamtools
- any other option?
It would look something like:
coordinateprojection -a readsA-contigsA.bam -b contigsA-chrsB.bam -o readsA-chrsB.bam
Just as an example dataset, here is the mouse readsets used for contig assembly piled up against the mouse contigs -- here just the whole gene body of mouse PAX2, PAX5 and PAX8. Also, the mouse assembled contigs aligned to human using lastz with chaining:
ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/readsMousecontigsMouse.bam ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/readsMousecontigsMouse.bam.bai
ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/contigsMousechrsHuman.bam ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/contigsMousechrsHuman.bam.bai
And this would be run into something like:
coordinateprojection -a readsMousecontigsMouse.bam -b contigsMousechrsHuman.bam -o readsMousechrsHuman.bam
Again, this is not bed-like information but actual nucleotide-by-nucleotide alignment coordinates, so it would involve the projection of each of the cigarlines position by position, including insertions and deletions.
Looking forward to suggestions.
My idea right now is to add this feature as a patch against samtools, since it would be easy to make use of the ftp/http fetching capabilities in it. Given that db providers like Ensembl/UCSC could easily produce speciesAspeciesB.bam ftp dumps from their pre-computed genomic alignments, this would make the projection method very simple and straightforward to the users.
Surely this is something you should write yourself? It's rather niche.
can you add a link to the previous discussion?
@Aaron Statham: thanks for the comment. My idea right now is to add this feature as a patch against samtools, since it would be useful to make use of the ftp/http fetching abilities in it. Given that db providers like Ensembl/UCSC could easily produce speciesAspeciesB.bam ftp dumps from their pre-computed genomic alignments, this would make the projection method very simple and straightforward to the users.
@Aaron Statham, thanks for the comment. My idea right now is to add this feature as a patch against samtools, since it would be easy to make use of the ftp/http fetching capabilities in it. Given that db providers like Ensembl/UCSC could easily produce speciesAspeciesB.bam ftp dumps from their pre-computed genomic alignments, this would make the projection method very simple and straightforward to the users.
I think the most difficult point in this scenario is the treatment of gaps and mismatches in the alignments between the genomes. How should e.g. an alignment mouse to human like 100M20I/D100M (>90% identity) be treated? If you used the tool to project coverage from mouse aligned reads to human chromosomes, which coverage should be assigned to the gaps? Is it possible to come up with a sensible definition?