I would like to align the contigs from the recent [1] assembly of NA12878 to the latest human genome reference sequence (hg19). I have considered using BWA-SW, BLAT and LASTZ. I would greatly prefer to use the SAM/BAM format because it will facilitate my downstream analysis. However, BWA-SW prefers query sequences in the 1-2Mb range, while this assembly has contigs in the tens of megabases. LASTZ, on the other hand, is not well-suited for aligning to many chromosomes at once. BLAT is difficult because the PSL to BAM conversion is imperfect.
Has anyone done this?
If you were to do this, what tool would you use or how would you go about it?
[1] Gnerre et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA (2011) vol. 108 (4) pp. 1513-8
Ah, thanks Heng. A colleague recently mentioned Salzberg's new aligner, but I had forgotten all about it. Yes, there are 80 contigs > 10Mb and 357 > 1Mb.
11.5Mb is the N50 of scaffolds. The contigs are only 24kb. BWA-SW will not align through the holds between contigs, so aligning contigs is preferred. Nonetheless, the whole-genome aligner may be a better choice. I do not know.
Yes, you're right. Sorry for the confused nomenclature.
@lh3: I've just tried a simple example between human-mouse: making mouse PAX2, PAX5 and PAX8 contigs from a 300x Illumina sequencing simulation assembled with Abyss, and then try to align the mouse contigs to human using bwa bwasw. It's not good, even with high -Z values: "samtools view ftp://ftp.ebi.ac.uk/pub/databases/ensembl/avilella/t/bwasw/mouse.pax5.x300.contigs.fa.fasta.human.bwasw.100000.bam"
If you have RNA-seq contigs, gmap and blat may be a better choice. I was mostly talking about mapping genomic sequences.
These are the whole pax genomic regions, 60~100K, for example http://www.ensembl.org/Mus_musculus/Location/View?g=ENSMUSG00000004231;r=19:44831882-44910520