Dear community,
As the title says: I'm confused.
There are a couple of pipelines to process Nanopore amplicon data (e.g. 16S). For example ONT-AmpSeq
: https://febs.onlinelibrary.wiley.com/doi/10.1002/2211-5463.13868
In these pipelines, often a polishing step (e.g. using Racon/Medaka) is included after clustering, and this is what confuses me.
This is what happens (in my interpretation): EDIT: I misinterpreted the consensus building. Not the consensus but an actual read (e.g. centroid) is polished
- The reads are clustered - based on a similarity threshold, e.g. 97% - and a consensus sequence is generated OR a vsearch centroid sequence is selected.
- The original reads - that were used as input to generate the consensus sequence - are then mapped to the consensus sequence
- A tool such as
racon
is then used to polish the consensus sequence, based on the original mapped reads - This results in a polished consensus sequence: the final OTU.
For me, it feels so illogical to polish a clustered consensus sequences, based on the same reads that yielded that consensus sequence. How can you polish something with the basically the same data? Especially in amplicon sequencing it feels odd, because some species have very similar regions of the often used rRNA genes.
In genome assemblies, creating a draft assembly with Nanopore data, and then polishing that assembly with Illumina data, makes a lot of sense to me. But I think polishing in genome assemblies also occurs with only Nanopore data.
I'm not trying to denigrate the specified pipeline or others, I just want to understand this concept and if it is really correct to do this.
Help me understand: Are you assuming that the reads belonging to the same cluster are identical at the nucleotide level?
Edit: I understand what you mean, and indeed, polishing feels unnecessary because of how the consensus sequence prior to the polishing step is generated.
A polishing step like that make more sense in a pipeline like NanoCLUST that doen't use vsearch to pick a consensus sequence of the from the cluster:
Thanks for the input. No what I mean is that the reads are clustered on a level of similarity, so e.g. 97%. Then based on that cluster a consensus sequence is generated. Then sometimes this consensus is polished again with the same input reads. In my head that feels illogical (or I misinterpret the pipeline).
I just now realized that the ONT-AmpSeq pipeline picks the centroid read of a cluster (i.e. the seed) and then uses the original reads to polish that centroid read. (edited my post a bit). But would you think that this makes sense? In the end the centroid read is just a read of that cluster, and has nothing to do with the Average Nucleotide Identity etc.
What you mention about NanoCLUST makes sense I think. Basically there they take the reads that 'mostly resembles' all of the reads and then polish that read based on the other reads in that cluster?
The centroid is not a consensus but a real sequence, with sequencing errors, that can serve as reference for the polishing step.
To me the strategy used by ONT-AmpSeq doesn't sound so different from the one described in NanoCLUST because the
read with the highest average intra-cluster ANI
is going to be the centroid or something very near to the centroid.thanks Andres, this makes sense now! I now realize I misinterpret the consensus building step.