Question

Change Coordinates -A 1.Bam -B 2.Bam -O 3.Bam?

4

Entering edit mode

13.4 years ago

2184687-1231-83- ★ 5.1k

I've got a BAM file 'short-A' of my short reads against the thousands of assembled contigs of my genome A, and I now would like to project the entries in BAM file 'short-A' onto the chromosomes in genome B. Genome A and genome B are too different to align the short reads from A straight to genome B, but close enough that I can align most of the thousands of assembled contigs from genome A to genome B, producing a 'contigs-A' file against genome B. So I now want to place each of the millions of short reads in my 'short-A' BAM file in the coordinates for genome B using the 'contigs-A' alignments as the guide.

Is there a standard way of doing that?

I don't know if it makes sense to have this in samtools or somewhere else, but I imagine it could be something like:

coordinateprojection -a readsAcontigsA.bam -b contigsAchrsB.bam -o readsAchrsB.bam

Just as an example dataset, here is the mouse readsets used for assembly piled up against the mouse contigs (here just the whole gene body of mouse PAX2, PAX5 and PAX8), and the mouse assembled contigs aligned to human using lastz:

ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/readsMousecontigsMouse.bam ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/readsMousecontigsMouse.bam.bai

ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/contigsMousechrsHuman.bam ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/coordprojection/contigsMousechrsHuman.bam.bai

And I would like to have a tool that does something like:

coordinateprojection -a readsMousecontigsMouse.bam -b contigsMousechrsHuman.bam -o readsMousechrsHuman.bam

Cheers

bam liftover samtools • 5.8k views

ADD COMMENT • link updated 13.4 years ago by Darked89 4.7k • written 13.4 years ago by 2184687-1231-83- ★ 5.1k

3

Entering edit mode

Why not re-align the reads to the new reference?

ADD REPLY • link 13.4 years ago by Aleksandr Levchuk 3.2k

3

Entering edit mode

What do you use for genome A vs genome B alignment? What is the format of the output? Not sure about your genome, but often one can get MAF file where multiple regions of A map to multiple regions of B.

ADD REPLY • link 13.4 years ago by Darked89 4.7k

3

Entering edit mode

what do you mean by 'projection of coordinates' to the new contigs and in what way is that different than mapping the reads to the new contigs?

ADD REPLY • link 13.4 years ago by Istvan Albert 101k

1

Entering edit mode

this would be easier if there were clear rules about dealing with ambiguities - multiple hits and converging hits

ADD REPLY • link 13.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Genome A and genome B are too different to re-align the short reads from A straight to genome B.

ADD REPLY • link 13.4 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

@Jeremy Leipzig: Supposing one has already dealt with most of them using "bwa aln" to map the short reads and "bwa bwasw" to map the contigs, I would be happy to just copy the whole lot on the other coordinates.

ADD REPLY • link 13.4 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

@avilella, if you can write out the specifics of how you want this method to work then I can probably implement it for you (most likely in R) and get this question answered within 6 days.

ADD REPLY • link 13.4 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

I've now added an example dataset using lastz for mouse contigs against human chromosomes. From the lastz soft sam I created a bam. In parallel, I created the bam pileup of the mouse reads into the mouse contigs.

ADD REPLY • link 13.4 years ago by 2184687-1231-83- ★ 5.1k

score 4 · Answer 1 · 2011-07-11

UCSC has a tool called liftOver that may be of help. You'll have to create a .chain file that maps regions from one genome to the other. Here is an explanation of how to do this. Briefly, you use BLAT to find the regions of similarity that are used to make the alignments.

I'm not sure that this will do better than re-mapping the reads, but it seems to be what you are requesting.

After you have the chain file, then you can use liftOver on your BAM coordinates to map them to the new genome.

score 2 · Answer 2 · 2011-07-14

if I understood correctly your point, since remapping reads against genome B is not an option, then brentp's solution is the one I would follow: if you need to project positions from one genome to other the best available tool nowadays is UCSC's LiftOver. if this solution is not what you are looking for then I would go for a combined approach of what most of us think you should do (remapping) and what it seems you want to do (project coordinates).

if you already have your genome A assembled contigs, I would split the BAM file into several BAM files containing only the reads of each contig (1 BAM file per contig). then I would project each contig start and end position into genomeB's reference, generating several sections of the entire reference (1 fasta reference of genomeB per genomeA aligned contig). finally I would pair each contig files (genomeA BAM file and genomeB fasta reference) and remap each genomeA contig reads against the associated genomeB projected reference.

in summary, if genomeA and genomeB are as similar as you expect, then forcing all reads from a genomeA contig to map its corresponding genomeB reference would retain a similar coverage, and you would end up with several genomeB BAM files which you could eventually merge into a single final genomeB BAM file containing all original reads now mapped as desired.

score 2 · Answer 3 · 2011-07-19

Since I read your question for the first time, the feeling was bothering me that the approach you want to take is invalid. I have been thinking about a solution since then and now see the need to supply an answer. However, that might not be the answer you would like to have, some critical points also due to missing information in your question.

In one sentence: you will make up (fake) read alignments, indirectly claiming that reads map in a position of genome B, while they in fact don't align there and could only be aligned to genome A.

First, I have to question your statement:

Genome A and genome B are too different to align the short reads from A straight to genome B, but close enough that I can align most of the thousands of assembled contigs from genome A to genome B

How can you possibly claim that with only trying bwa? And what is the real distance between those genomes, what is the sequence identity, and what is the degree of genome rearrangements/syntheny?

One of the core problems of your approach is using bwa to align the reads. bwa has low sensitivity, especially when it comes to more different sequences. So, try a more sensitive aligner first to map the reads directly, eg. bfast, blat, mosaic, novoalign, lastz, or try LAST (thanx to darked's comment).

Why is the mapping approach invalid? There are two approaches I could think of:

globally align whole contigs of genome A against Genome B (pretty much treating it like a large short read), yielding one mapping coordinate, ambiguity asside, for the contig of A in genome B.
locally align the contigs such that partial contigs of A, with minimum sequence identity cut-off, within a contig can be aligned to a region in B, yielding possibly a set of genomic coordinates in B per contig.

Both approaches have in common that they don't deal with ambiguities, and that is a big deal.

Then, I claim that approach 1 in invalid and will create bogus mapped coordinates of reads, assigning them to regions where definitely do not align.

Only one example of what could go wrong: Imagine the following situation, a contig of genome A, consisting of three regions R1, R2, R3 of high sequence conservation (think of that for example as three highly conserved genes). All these regions are present in genome B as well (R1', R2', R3'), but due to a rearrangement, R2' is replaced by R4' and relocated to a different position. Now, following your relatively low sequence-identity threshold R1,R2,R3 -> R1',R4',R3'. Now for genome A, you got a lot of reads for R2, which now will be assigned to R1',R4',R3' but they will be assigned to region R4', where they definitely do not belong to.

Based on the above, the second approach would be much better, that would be best accomplished by a bidirectional best blast hit between the genomes (with eg. identity cut-off 90%), and then mapping those regions. Would also need to treat ambiguities somehow, and again bwa is not the tool of choice. However, if stretches of high sequence similarity existed, then the reads would possibly also align there directly.

Conclusion: Before thinking about how to technically perform an analysis, we have to think more about the validity of the approach. Lacking some important pieces of information, the above couldn't be the final word there. I consider mapping read alignments between distant genomes problematic. The risk is to build up the impression (or tricking yourself into believing) that there exist alignments where in fact they couldn't be aligned, and that ambiguities go unnoticed. Only alignments that can be found and proven are valid alignments, if your reads don't align, yes, then they don't align. Try more sensitive tools to align reads directly.

Ram · Answer 4 · 2011-07-19

1

Entering edit mode

13.4 years ago

Darked89 4.7k

You did not provided info about the reads you are mapping (Illumina paired?), their lengths and the genomes (no exact species needed, but it may help if one knows if these are i.e. compact Eukaryote, complex mammals, or some even more exotic combination).

For inter-mammalian short read mapping you may look at LAST last.cbrc.jp): http://last.cbrc.jp/doc/last-tutorial.txt (see Example 8: Align aardvark fastq reads to the human genome). That way you may (or not...) sensibly align reads from A to B.

ADD COMMENT • link updated 5.2 years ago by Ram 44k • written 13.4 years ago by Darked89 4.7k

0

Entering edit mode

I've now added an example where I first assembled short reads from species A in contigs in species A, cross-align contigs in species A to species B using LASTZ, and also pile up all the reads from species A onto the assembled contigs in species A. I would now want to do the coordinate projection from one bam to another as described in the example.

ADD REPLY • link 13.4 years ago by 2184687-1231-83- ★ 5.1k