Coments On This Feature: Alignment Coordinate Projection From 1.Bam To 2.Bam To Create 3.Bam?
2
5
Entering edit mode
13.4 years ago

After gathering some comments about this in the previous post, I would like to know where people think this feature could best fit into:

Having a BAM file 'short-A' from short reads against the thousands of assembled contigs of genome A, one wants to project the entries in BAM file 'short-A' onto the chromosomes in genome B. Genome A and genome B are too divergent to directly align the short reads from A straight to genome B, but one can align most of the thousands of assembled contigs from genome A to genome B, producing a 'contigs-A' file against genome B. This method would place each of the millions of short reads in my 'short-A' BAM file in the coordinates for genome B using the 'contigs-A' alignments as the guide, by projecting each position in the cigar lines of the coordinate system of one alignment set to the other.

For what I've googled around, I think this could go into:

  • Samtools
  • Bamtools
  • any other option?

It would look something like:

coordinateprojection -a readsA-contigsA.bam -b contigsA-chrsB.bam -o readsA-chrsB.bam

Just as an example dataset, here is the mouse readsets used for contig assembly piled up against the mouse contigs -- here just the whole gene body of mouse PAX2, PAX5 and PAX8. Also, the mouse assembled contigs aligned to human using lastz with chaining:

ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/readsMousecontigsMouse.bam ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/readsMousecontigsMouse.bam.bai

ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/contigsMousechrsHuman.bam ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/contigsMousechrsHuman.bam.bai

And this would be run into something like:

coordinateprojection -a readsMousecontigsMouse.bam -b contigsMousechrsHuman.bam -o readsMousechrsHuman.bam

Again, this is not bed-like information but actual nucleotide-by-nucleotide alignment coordinates, so it would involve the projection of each of the cigarlines position by position, including insertions and deletions.

Looking forward to suggestions.

My idea right now is to add this feature as a patch against samtools, since it would be easy to make use of the ftp/http fetching capabilities in it. Given that db providers like Ensembl/UCSC could easily produce speciesAspeciesB.bam ftp dumps from their pre-computed genomic alignments, this would make the projection method very simple and straightforward to the users.

bam samtools liftover • 4.0k views
ADD COMMENT
2
Entering edit mode

Surely this is something you should write yourself? It's rather niche.

ADD REPLY
0
Entering edit mode

can you add a link to the previous discussion?

ADD REPLY
0
Entering edit mode

@Aaron Statham: thanks for the comment. My idea right now is to add this feature as a patch against samtools, since it would be useful to make use of the ftp/http fetching abilities in it. Given that db providers like Ensembl/UCSC could easily produce speciesAspeciesB.bam ftp dumps from their pre-computed genomic alignments, this would make the projection method very simple and straightforward to the users.

ADD REPLY
0
Entering edit mode

@Aaron Statham, thanks for the comment. My idea right now is to add this feature as a patch against samtools, since it would be easy to make use of the ftp/http fetching capabilities in it. Given that db providers like Ensembl/UCSC could easily produce speciesAspeciesB.bam ftp dumps from their pre-computed genomic alignments, this would make the projection method very simple and straightforward to the users.

ADD REPLY
0
Entering edit mode

I think the most difficult point in this scenario is the treatment of gaps and mismatches in the alignments between the genomes. How should e.g. an alignment mouse to human like 100M20I/D100M (>90% identity) be treated? If you used the tool to project coverage from mouse aligned reads to human chromosomes, which coverage should be assigned to the gaps? Is it possible to come up with a sensible definition?

ADD REPLY
6
Entering edit mode
13.4 years ago

Perhaps I am being naive, but it seems that this is exactly the sort of thing that the UCSC liftover tool is meant to handle. Liftover is not restricted to different genome builds of the same species. It can be used to project mouse alignments to human; the projections are limited to the species for which the UCSC folks have already generated "chain" files.

So, a basic recipe for your problem using a mix of UCSC liftover and BEDTools might be:

  1. Convert your 'short-A' BAM file to BED format (bamToBed).
  2. Use the liftover tool to project these BED features to genomeB.
  3. Convert the project BED features to BAM (bedToBam) format.
  4. Sort and index the projected BAM.
ADD COMMENT
2
Entering edit mode

ah right, that level of detail would certainly be lost.

ADD REPLY
2
Entering edit mode

I suspect this is a case where you will need to write your own tool...

ADD REPLY
0
Entering edit mode

I don't think converting to bed would allow me to project the cigarlines position by position, including insertions and deletions.

ADD REPLY
0
Entering edit mode

I've edited my question, I think it may be clearer now. Thx!

ADD REPLY
0
Entering edit mode

I kind of feel this should be in samtools or bamtools or one of these, so I want to make sure it makes sense as a feature overall.

ADD REPLY
0
Entering edit mode

@aaronQuinlan, I've rewritten my question as a request for comments.

ADD REPLY
4
Entering edit mode
13.4 years ago

This would probably be best as a stand-alone tool for now. Evolve it to become stable and applicable. You can later make a Github pull request offering project-coordinates to be embedded into a toolset that will be appropriate at the time.

Bamtools and Samtools are both MIT licensed so you may want to start with that license too so that you can merge code any time with little or no discussion.

ADD COMMENT

Login before adding your answer.

Traffic: 2796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6