Entering edit mode
5.6 years ago
BioinformaticsLad
▴
200
I'm aware that nanopore has a problem accurately sequencing simple repeats (eg. atatatata). But what about tetra-, penta-, hexa-nucleotide repeats? At what repeat length is nanopore (sufficiently) accurate?
I also thought it was only homopolymers that's a problem in nanopore reads. But from doing structural variant calling, I discovered a huge number of calls in simple repeat regions which I was suspicious of so I dug a bit deeper. Then I also read this in the PacBio whitepaper on human structural variations
Edit: I've never heard anything about repeats being a problem from any ONT talks I heard or papers I've read, so this is all news to me.
Homopolymers are definitely a known issue (the current signal doesn't change in the pore, so can't easily figure out how many nucleotides are seen). I wouldn't really trust a PacBio whitepaper on ONT performance :-)
That said, it is also a known feature that repeats are highly polymorphic. Some of the calls in those repeat regions might as well be real.
So there's clearly a conflict of interest when PacBio says ONT sucks! But I doubt they would outright lie (??)
Yes, repeats being highly polymorphic is a valid point and I did think about that, but the problem with my alignments is that they are all of different lengths. If they come from a clonal population, we should expect them to be of (somewhat) equal length?
See example: https://imgur.com/a/qHMR130
Edit: This is the NA12878 dataset from Miten Jain paper.
That looks roughly how I would expect it. Note that part of the noisiness is also because of the aligner shifting the deletion a bit more to the left or to the right.
I think things are slightly better using more recent base callers, i.e. Guppy Flipflop (suspect your screenshot is from an older guppy). If you give me the coordinates and the genome build of that repeat I can take a look in my data and show you a screenshot.
For what it's worth, in the context of SV calling I would remove everything that is smaller than 30 bp. Small indels isn't exactly what I'd use long read sequencing for.
Yeah, this was basecalled using Albacore from way back in 2018. The shifting left or right would make sense if they were perfect repeats, but there's some slight variations in the repeat motifs so if basecalling was not a problem, the breakpoints should align almost exactly like so: https://imgur.com/a/9mRt94E
I think it means tatataaataaa may be mistaken for tataataataa These reads are long enough to be anchored on both sides to non-repeat regions so this isn't an issue of the aligner not knowing where to place the reads either.
I'd truly appreciate it if you could! hg38 chr5:52,029,030-52,029,135
This is how it looks like in our NA19240 PromethION data (guppy flipflop): https://imgur.com/ck60LZS (which is available https://www.ebi.ac.uk/ena/data/view/SAMEA5418551 )
That does look better. So it seems repeats shouldn't be a problem with the new basecaller.
Thanks Wouter, you truly are the patron saint of Nanopore! I'm a big fan of NanoPlot by the way.