What Is The Best Method For Aligning Two Genome Assemblies?
1
6
Entering edit mode
13.8 years ago

I would like to align the contigs from the recent [1] assembly of NA12878 to the latest human genome reference sequence (hg19). I have considered using BWA-SW, BLAT and LASTZ. I would greatly prefer to use the SAM/BAM format because it will facilitate my downstream analysis. However, BWA-SW prefers query sequences in the 1-2Mb range, while this assembly has contigs in the tens of megabases. LASTZ, on the other hand, is not well-suited for aligning to many chromosomes at once. BLAT is difficult because the PSL to BAM conversion is imperfect.

Has anyone done this?

If you were to do this, what tool would you use or how would you go about it?

[1] Gnerre et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA (2011) vol. 108 (4) pp. 1513-8

alignment assembly genome sam • 9.2k views
ADD COMMENT
5
Entering edit mode
13.8 years ago
lh3 33k

Probably you want to try this:

http://www.citeulike.org/group/10570/article/8403903

I would probably split long contigs into 1Mbp chunks and use BWA-SW (I actually wanted to do this but have not got time). By the way, they get tens of Mbp contigs? How long are scaffolds/supercontigs?

EDIT:

Perhaps also try this:

http://www.cs.utoronto.ca/~brudno/721.full.pdf

Just read the NA12878 paper. The contig N50 is 24kb. I would certainly map contigs rather than supercontigs.

EDIT2:

Aaron, have you tried Mugsy (the one described by the link above)? As I read the paper just now, it may need tens of CPU days to align two human assemblies. For a 1000g request, I have mapped the NA12878 contigs using BWA-SW.

ADD COMMENT
0
Entering edit mode

Ah, thanks Heng. A colleague recently mentioned Salzberg's new aligner, but I had forgotten all about it. Yes, there are 80 contigs > 10Mb and 357 > 1Mb.

ADD REPLY
0
Entering edit mode

11.5Mb is the N50 of scaffolds. The contigs are only 24kb. BWA-SW will not align through the holds between contigs, so aligning contigs is preferred. Nonetheless, the whole-genome aligner may be a better choice. I do not know.

ADD REPLY
0
Entering edit mode

Yes, you're right. Sorry for the confused nomenclature.

ADD REPLY
0
Entering edit mode

@lh3: I've just tried a simple example between human-mouse: making mouse PAX2, PAX5 and PAX8 contigs from a 300x Illumina sequencing simulation assembled with Abyss, and then try to align the mouse contigs to human using bwa bwasw. It's not good, even with high -Z values: "samtools view ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/bwasw/mouse.pax5.x300.contigs.fa.fasta.human.bwasw.100000.bam"

ADD REPLY
0
Entering edit mode

If you have RNA-seq contigs, gmap and blat may be a better choice. I was mostly talking about mapping genomic sequences.

ADD REPLY
0
Entering edit mode

These are the whole pax genomic regions, 60~100K, for example http://www.ensembl.org/Mus_musculus/Location/View?g=ENSMUSG00000004231;r=19:44831882-44910520

ADD REPLY

Login before adding your answer.

Traffic: 2342 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6