Hi,
with the CD-HIT command cd-hit-est
it is possible to form sequence clusters. Per cluster, 1 "representative" sequence is generated, as stated at the CD-HIT website:
... and produces a set of 'non-redundant' (nr) representative sequences as output.
Is such a nr representative sequence the same as a consensus sequence in CD-HIT? I want to use cd-hit-est
to cluster Nanopore amplicon sequence data.
The NCBI website "https://www.ncbi.nlm.nih.gov/mesh?Db=mesh&Cmd=DetailsSearch&Term=%22Consensus+Sequence%22%5BMeSH+Terms%5D" calls a consensus sequence a representative sequence. However, I'd like to know if CD-HIT also defines a representative sequence as a consensus sequence.
Any ideas? Thank you.
To the best of my knowledge, a representative sequence is not the same as a consensus (in the world of
CD-HIT
at least).A representative sequence is a sequence from that cluster, meaning that all the other sequences within that cluster are within some edit distance of the representative. I forget how CD-HIT chooses its representatives (might just be the longest or first sequence in order etc).
As CD-HIT sorts and then processes the sequences from longest to shortest, it is both: each clusters representative sequence is both the longest for that cluster, and also the first to enter the cluster.
@Joe and @h.mon. Thank you for the valuable information. So if I understand it correctly: CD-HIT-EST first sorts all sequences. Then the first (and thus longest) sequence in that sorted list becomes the representative sequence of cluster 1. Then the 2nd sequence is evaluated. If this 2nd sequence is within the specified edit distance (parameter "-c") it is assigned to cluster 1. If this 2nd sequence differs more than the specified distance, this sequence becomes the representative sequence of cluster 2? And so on for the third, fourth ,...... n-th sequenc?
That would be my assumption, yep :)
In the fast mode - which is the default - yes, that is precisely what is being done. In accurate mode, a sequence is compared to all representative sequences, and is added to the most similar one. From the wiki:
You may check the wiki if you have further questions, it has a lot of information.