I'm confused: polishing Nanopore consensus sequences (amplicons) with the original reads
0
0
Entering edit mode
4 weeks ago
rDNA ▴ 20

Dear community,

As the title says: I'm confused.

There are a couple of pipelines to process Nanopore amplicon data (e.g. 16S). For example ONT-AmpSeq: https://febs.onlinelibrary.wiley.com/doi/10.1002/2211-5463.13868

In these pipelines, often a polishing step (e.g. using Racon/Medaka) is included after clustering, and this is what confuses me.

This is what happens (in my interpretation): EDIT: I misinterpreted the consensus building. Not the consensus but an actual read (e.g. centroid) is polished

  1. The reads are clustered - based on a similarity threshold, e.g. 97% - and a consensus sequence is generated OR a vsearch centroid sequence is selected.
  2. The original reads - that were used as input to generate the consensus sequence - are then mapped to the consensus sequence
  3. A tool such as racon is then used to polish the consensus sequence, based on the original mapped reads
  4. This results in a polished consensus sequence: the final OTU.

For me, it feels so illogical to polish a clustered consensus sequences, based on the same reads that yielded that consensus sequence. How can you polish something with the basically the same data? Especially in amplicon sequencing it feels odd, because some species have very similar regions of the often used rRNA genes.

In genome assemblies, creating a draft assembly with Nanopore data, and then polishing that assembly with Illumina data, makes a lot of sense to me. But I think polishing in genome assemblies also occurs with only Nanopore data.

I'm not trying to denigrate the specified pipeline or others, I just want to understand this concept and if it is really correct to do this.

polishing amplicons pipeline racon nanopore • 431 views
ADD COMMENT
1
Entering edit mode

Help me understand: Are you assuming that the reads belonging to the same cluster are identical at the nucleotide level?

Edit: I understand what you mean, and indeed, polishing feels unnecessary because of how the consensus sequence prior to the polishing step is generated.

A polishing step like that make more sense in a pipeline like NanoCLUST that doen't use vsearch to pick a consensus sequence of the from the cluster:

The next step builds a consensus sequence from the reads belonging to each cluster. For that, the pairwise Average Nucleotide Identity (ANI) between reads in the same cluster is calculated using FastANI. Then, the read with the highest average intra-cluster ANI is chosen and 100 other reads from the same cluster are selected for polishing the sequence. The polishing stage includes one round each in the Canu read-correction module, in Racon, and in Medaka.

ADD REPLY
0
Entering edit mode

Thanks for the input. No what I mean is that the reads are clustered on a level of similarity, so e.g. 97%. Then based on that cluster a consensus sequence is generated. Then sometimes this consensus is polished again with the same input reads. In my head that feels illogical (or I misinterpret the pipeline).

I just now realized that the ONT-AmpSeq pipeline picks the centroid read of a cluster (i.e. the seed) and then uses the original reads to polish that centroid read. (edited my post a bit). But would you think that this makes sense? In the end the centroid read is just a read of that cluster, and has nothing to do with the Average Nucleotide Identity etc.

What you mention about NanoCLUST makes sense I think. Basically there they take the reads that 'mostly resembles' all of the reads and then polish that read based on the other reads in that cluster?

ADD REPLY
1
Entering edit mode

But would you think that this makes sense? In the end the centroid read is just a read of that cluster, and has nothing to do with the Average Nucleotide Identity etc.

The centroid is not a consensus but a real sequence, with sequencing errors, that can serve as reference for the polishing step.

To me the strategy used by ONT-AmpSeq doesn't sound so different from the one described in NanoCLUST because the read with the highest average intra-cluster ANI is going to be the centroid or something very near to the centroid.

ADD REPLY
0
Entering edit mode

thanks Andres, this makes sense now! I now realize I misinterpret the consensus building step.

ADD REPLY

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6