2) The error rate is based on Illumina's error rate. The synthetic long reads are just assemblies of many Illumina short reads generated from the same long-ish molecule, so typically, the error rate should be much lower than a single Illumina read... barring misassemblies.
4) It's not really relevant for denovo assembly because it still can't (in general) resolve repeats longer than read length. For assembly it's no better than shotgun sequencing (according to people at my lab who experimented with it), but a lot more expensive.
We have found it to be useful for resolving genic regions and it can resolve repetitive regions as long as they are not tandem repeats. But we have also found it appears significantly biased for specific genomic regions. Never really saw obvious problems with mis-assemblies.
I'm interested to see how 10x Genomics data pans out, which uses a similar strategy (batches of assemblies) but scaled up using emulsion PCR. They just released their own assembler, Supernova.
Good point - it should theoretically be able to resolve repeats outside of the "long read", just not inside of it. So, for example, it should be better at assembling ribosomal sequences, which are often present in many copies, but are not tandem repeats.
Individual reads are not 10K long. No current Illumina sequencers can produce reads that long. That is the starting length of DNA that goes into these libraries.
Libraries can be created using as low as 500 ng starting DNA (info from the PDF application note on the page you linked above).
We have seen 'reads' (e.g. assembled fragments) up to 10kb and sometimes longer, but they typically average more 6-8kb if you have a decent library prep.
So Illumina performs an assembly first and give us the long "read".
What is the advantage of their method ? they guarantee long fragments ? I mean we could perform ourself that assembly
The advantage is that the long read assembly happens on a smaller scale. If you are doing regular Illumina sequencing, each read can come from anywhere in the genome. With the synthetic long reads, each well only has a few ~10kb fragments. Therefore, each read from that well should assemble into those fragments. You are assembling a small part of a genome as opposed to a whole genome. You could extract the individual short reads and assemble them yourself. In fact, that is what you get by default. You then have to use BaseSpace to do the long read assembly that will generate those synthetic long reads.
Another advantage is that you don't need additional instrumentation if you are an Illumina facility. Since it's using Illumina technology, it's also cheaper than PacBio or Nanopore (a major concern if your genome is over 100 MB).
Libraries can be created using as low as 500 ng starting DNA (info
from the PDF application note on the page you linked above).
That is true, but the DNA has to be of very good quality. My group has done this a few times and we had to restart with new DNA every time because it turned out that it was not good enough despite fulfilling the official requirements.
It's also an extremely laborious process compared to other library preps.
We have found it to be useful for resolving genic regions and it can resolve repetitive regions as long as they are not tandem repeats. But we have also found it appears significantly biased for specific genomic regions. Never really saw obvious problems with mis-assemblies.
I'm interested to see how 10x Genomics data pans out, which uses a similar strategy (batches of assemblies) but scaled up using emulsion PCR. They just released their own assembler, Supernova.
Good point - it should theoretically be able to resolve repeats outside of the "long read", just not inside of it. So, for example, it should be better at assembling ribosomal sequences, which are often present in many copies, but are not tandem repeats.
Thanks for your answer.