Hi, I'll appreciate your help regarding following questions:
why should we map our reads with a
reference genome? 1.
how a denovo
assembler software work without a
ref genome, e.g. it works using
other source as its references such
as EST, GSS etc or maybe it works
only using overlapping parts of
reads?
Hi Aboozar. Welcome to Biostars! I would like to point out that, without much more context, it is very unlikely that the forum users will be able to provide you with a sensible answer. Please take some time to tell us what kind of data you have, what is your experiment's outlines, and exactly what you are trying to accomplish. You will find that you will get much more useful answers that way. These answers will in turn help others that may have similar questions. Cheers
Yes, it's always easier to solve a jigsaw puzzle by looking at the box. When you don't have the box you need to compare the pieces themselves. This becomes onerous with so many pieces of different lengths and shapes, so most modern assemblers chop the pieces up into even smaller squares. This might seem ridiculous, but it allows the puzzle to be solved using an index rather than comparing a billion pieces with each other.
If I'm fairly certain that my current sequencing data has a high similarity to an already published reference genome, it's a lot faster to align to a reference genome than it is to try for a de novo assembly. De novo assembly will also require lots more RAM (I think upwards of 64 GB for assembly vs 3-5 GB for mapping) and perhaps deeper coverage depending on your technology.
However, you do impose a prior expectation of what you think the entire genome looks like, which is not accurate (along the lines of what Jeremy said in his answer). While >99% of any human genome (for instance) may be identical to hg19, the fraction that doesn't match could have interesting features such as indels or structural rearrangements that you may miss. Short read aligners can only tell you what resembles your reference genome, not the inevitable parts that differ too much. These reads will simply be unmapped.
One approach to this problem I've heard of, but don't have much experience with, is to first do a short-read alignment and knock your >99% matching to hg19 out of the way, then attempt an assembly on the remaining reads to find the structural features represented by the high-quality unmapped reads. It looks like this might be called comparative genome assembly. There are also programs like BreakSeq out there that will specifically look for structural rearrangements.
These approaches are computationally cheaper than assembly, take advantage of a reference sequence, but still acknowledge that there are unique structural features to any genome.
It is done because most of the genomes like human and mouse are almost complete and well annotated, most of the sequencing based questions can be answered by aligning the reads against the respective genome. Computationally it is much more easier and efficient to align reads rather than to do a de novo assembly. It is much easier to parallelize an aligner as compared to an assembler, and assemblers require an order of magnitude more memory as compared to aligners.
De novo assembly is good if you do not have a reference genome to start with (eg. some exotic fish). The assemblers makes a graph of the overlapping parts of the reads (k-mers) and then find 'long paths' within this graph and reports them as contigs. Assembling a mammalian genome is quite expensive and require > 30x coverage. Even then there are problems with repeats and mis-assembly which can be difficult to correct. Ideally you would need long reads or at-least paired end reads with long insert size to get good assembly results.
It would be better to split the two questions into two separate questions.
Hi Aboozar. Welcome to Biostars! I would like to point out that, without much more context, it is very unlikely that the forum users will be able to provide you with a sensible answer. Please take some time to tell us what kind of data you have, what is your experiment's outlines, and exactly what you are trying to accomplish. You will find that you will get much more useful answers that way. These answers will in turn help others that may have similar questions. Cheers