I can't find a clear definition that differentiates sequence assembly vs scaffolding.
From my understanding,
Assembly = joining reads into contigs
Scaffolding = joining contigs into scaffolds (using eg paired-end reads)
Does that sound right? It seems that assembly must be followed by scaffolding, but definitions of assembly don't even talk about scaffolding. Can you do assemble a whole genome with just "assembly"?
Assembly is not exactly "joining reads into contigs", but "creating contigs from reads", which is more general. Joining implies the reads are intact (which is sometimes true) but most modern assemblies break them into kmers first and don't actually join any reads.
Contigs are sequences of overlapping (contigous) reads. Paired-end (or mate-pair) reads can be used to determine the gap between two contigs. When you know the gap, you can make a scaffold, which is just the two contigs with Ns representing the gap in between.
EDIT: Can you do assemble a whole genome with just "assembly"?
Yes. Scaffolding won't give you more information about the actual bases anyways; it just tries to tell you how your contigs are ordered.
ADD COMMENT
• link
updated 6.2 years ago by
Ram
44k
•
written 8.7 years ago by
novice
★
1.1k
That's not an easy question to answer. Sometimes, people assume it is the length according to the kit. For example, if you have some site-specific enzyme that's supposed to cut every 10kbp on average... then maybe you have a 10kbp library! Or, maybe not.
When possible, it's best to use mapping. If you generate contigs, then keep only the nice long contigs (>20kbp or so) and map to them, and you will get a good insert size distribution. The longer the contigs are with respect to your expected insert size, the less bias you will get, so the ">20kbp" thing actually varies. If you are scaffolding with short-insert reads of 200-400bp insert, then retaining all contigs over 1kbp would be fine.
But, what if all your contigs are shorter than your expected insert size? Then... who knows. Try mapping to a related species with a reference, perhaps. BBMerge has a kmer-based mode for merging nonoverlapping read pairs via assembly, which can be used for inferring insert sizes. It's more forgiving than assembly because it ignores some classes of branches. But, I've never tried it with really long inserts (>4kbp) and would not expect it to work all that well.
It's not difficult, but it is data-dependent. What kind of library are you trying to use for scaffolding, and what is the length distribution of your contigs?
for me. I have contigs and I do not know exactly the length contigs but mybe their average length is about 300kbp (or plus). With these contigs I want to do a scaffolding. How to do in this case?? thanks
A read library is a set of reads processed together (in the laboratory). To describe a library, you need to state:
What kind of input genetic material was used, what platform was your data sequenced on, how long are the reads, what is the expected insert size, how were they fragmented, what chemistry was used, etc. If you are not sure, then ask whoever sequenced the DNA; you have to know this before processing the data.
Ok... so, pick a program that does scaffolding, like sspace. Map your reads to your contigs to get the insert size distribution, or whatever it requires as an input. Then run the program according to its instructions (I've never used it, personally).
Yes thanks.
But I want to program a scaffolder. this is why I want to know how to do it scafolding because in the papers is not clear.
That's why I need help.
Thank you
Basically... a scaffolding program constructs a graph in which contigs are the nodes and read pairs are the edges; two contigs A and B are joined by an edge if one read maps to A and the other read maps to B. The processing determines which edges are real, and which are spurious. Once that is known, it is simple to condense the nodes into linear scaffolds.
This discussion ignores issues like sequencing errors and repeated sequences which make scaffolding difficult.
when you say "a scaffolding program constructs a graph in which contigs are the nodes and read pairs are the edges; two contigs A and B are joined by an edge if one read maps to A and the other read maps to B. The processing determines which edges are real, and which are spurious".
Here we use just paired-end reads?
Sorry, I'm fuzzy on the scaffolding process.
ADD REPLY
• link
updated 6.2 years ago by
Ram
44k
•
written 8.7 years ago by
midox
▴
290
Assembly is not exactly "joining reads into contigs", but "creating contigs from reads", which is more general. Joining implies the reads are intact (which is sometimes true) but most modern assemblies break them into kmers first and don't actually join any reads.