Hi all,
I would like to know more about long read sequencing like Nanopore or PacBio does. Getting the basic biochemistry behind the idea and the current problems (higher error rate i.e due to homopolymers) was not a big issue, but processing and making actual use of the data (in a bioinformatical sense) is not too easy, lets say. Does anyone of you have experience with these sequencing technologies? What sequencing depth with long reads (5kb and more), for instance, would you suggest to detect SNVs, INDELS and CNVs for whole exome or genome? Can we trust the new methods Nanopore claims to have to reduce sequencing error close to 1% (Hidden Markov Fields)? What about methylation? I would really appreciate some feedback or literature suggestions (appart from the papers Nanopore presents).
Thanks guys and have a nice day,
Chris
Two preprints we recently published, using PromethION data for structural variation and tandem repeat variation: Structural variants identified by Oxford Nanopore PromethION sequencing of the human genome and Accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION.
Thanks for the fast reply. The second publication I read before, and this is what keeps me wondering. On the Nanopore webpage, they claim that the flow-cell of the PromethION can theoretically read max 350GB, but in practice 150GB are reached. When reading publications using this device, something between 70GB and 90GB is reached per flow-cell. Given the distribution of the base-calling accuracy and the cost per flow-cell (smallest bundle: 2000$ per cell), this result is not too impressive. For our project, we were considering using the new PromethION flow-cell, facilitating the four channels. We calculated with 120GB per flow-cell, which would leave us with 30GB per channel. Since only the coding region is of interest for us, we were thinking about 200X whole exome and the remaining sequencing ability for RNA (per channel!). Using small reads, like Illumina provides, 400X are needed for whole exome to detect SNV or CNV. What is your opinion on that?
As far as I know the maximum yield reached internally at ONT is slightly above 200Gb. In external labs, such as ours, yields are more modest, and rarely go over 100Gb. We are not yet close at the theoretical yield, but improvements the past months suggest we're getting there eventually. It all comes down to doing a decent library prep on nicely fresh DNA.
The current flow cell design of the PromethION has one physical chamber. Yes, there is a design with 4 inlets and physical separation, but that's not what is produced now. I've been told they would reintroduce this one, but no timeline was given.
That just doesn't seem like the right fit for a nanopore platform. Why (and how?) would you do target enrichment? Coding exons are ~200 nucleotides, so you would only sequence small fragments. That's not where the strength of the technology is. PCR-based exome sequencing preps (as in Illumina) are not the best you can do here. If you are going to sequence short reads, then do it on a short read platform.
While it is technically possible to detect SNVs/SNPs on PromethION, it's not the most appropriate application. The real strength is in previously hidden structural variants, including CNVs, inversions, tandem repeat expansions and so on. So the things you cannot (easily/accurately/at all) detect with Illumina. For that purpose 10x coverage is sufficient for a high precision, but recall increases up to 40x coverage and flattens off above. I have no idea where you got the 400x requirement.
Further reactions:
As mentioned in another comment: SNPs is not the main application of a nanopore platform, and neither is exome sequencing.
That's a bit unclear to me. Can you elaborate? It seems you are referring to hidden markov models, but that's an approach ONT hasn't been using for basecalling for a while. Current basecaller is a recurrent neural network, but I've also heard they're moving on to another model 'in the near future'.
Citation needed. This could be a consensus accuracy, or a estimate for future performance using 1D^2 sequencing. It is for sure not the current error rate, which is at 12% for human data.
Yes you can, provided that you sequence native DNA (so without amplification steps). See for example this recent publication: Using long-read sequencing to detect imprinted DNA methylation
Sorry, I am a bit caught up in some of our recent samples, where we used such deep exome sequencing due to very rare combination genetic disorders and this is the only exome seq I work with so far (I am actually a RNA-guy). 100X for Homo-/Heterozygous SNVs seem reasonable (Meynert et al., 2013).
We had a talk recently at our institute held by Nanopore (this is how my interest in Nanopore rose) claiming that their goal by the end of the year is to reduce error rate drastically (yes, in conjunction with 1D^2 sequencing, might be consensus accuracy); close to 1% was their goal. Nevertheless, you are right, without actually proving and make the result available to everyone, that statement is very misleading! Consider it retracted!
You are right, the latest base caller is based on neural networks. My information came from this paper Franka J. Rang et al., but HMM are outdated!
Thanks for all your effort sharing your experience with me. That discussion beforehand could have saved me some time!
Not going to dig into it again, but from memory I believe that papers says 13x coverage is okay for 95% sensitivity for heterozygous SNPs?
Long story short: using long reads can give you thousands of structural variants which you cannot find with Illumina. That's the main application (for human DNA sequencing) right now.