Greetings!
AFAIK, CD-HIT requires sorting by length before performing the clustering step. Am I right? If yes, please read on. If not, then there is no question :)
I used CD-HIT based clustering to remove 100% identical sequences after retaining just one representative, and then proceed with a BLASTp all-by-all. This final file is large ~ 170GB.
But I just remembered I did not sort by length initially before the CD-HIT 100% nr step!
With that as context, I have a few questions:
1. A while back, I remember sorting my input by length, before the CD-HIT step per se But now, I can't seem to remember if it was a Perl script or some other executable inside CD-HIT or from elsewhere. Can someone help?
2. What happens to the validity of my results if my clustering (at 100% identity) was performed without sorting by length?
I do not mind having a few additional sequences that should not have been there for the BLASTp step, BUT it would be a problem if sequences were removed that should have been retained in the results file (used for BLASTp).
3. Does anyone have advice based on theory or practice? Thanks! (apart from repeating it afresh ha)
Happy New Year 2019! :)
AFAIK, cd-hit will sort the input itself, so no need to do it yourself prior to running cd-hit.
Unless there are other reasons to first cluster them I would not take the effort myself and immediately proceed to running the blast. In that context I also advise you to request the tabular output (unless you are already doing so) to save quite some space for the output file.
Thank you! It explains why I could not remember the sorting step clearly, or find any utility that performs it explicitly. HNY! Cheers!
lieven.sterck : Apologies for high jacking this thread for a minute. You have been promoted to moderator on biostars. Please join the biostars slack channel as described here: Inviting NEW Biostars moderators to join Biostars slack channel
CDHIT requires no manual sorting.
Thanks for confirming. Cheers and HNY!