CD-HIT results without sorting by length
0
0
Entering edit mode
5.9 years ago
Anand Rao ▴ 640

Greetings!

AFAIK, CD-HIT requires sorting by length before performing the clustering step. Am I right? If yes, please read on. If not, then there is no question :)

I used CD-HIT based clustering to remove 100% identical sequences after retaining just one representative, and then proceed with a BLASTp all-by-all. This final file is large ~ 170GB.

But I just remembered I did not sort by length initially before the CD-HIT 100% nr step!

With that as context, I have a few questions:

1. A while back, I remember sorting my input by length, before the CD-HIT step per se But now, I can't seem to remember if it was a Perl script or some other executable inside CD-HIT or from elsewhere. Can someone help?

2. What happens to the validity of my results if my clustering (at 100% identity) was performed without sorting by length?

I do not mind having a few additional sequences that should not have been there for the BLASTp step, BUT it would be a problem if sequences were removed that should have been retained in the results file (used for BLASTp).

3. Does anyone have advice based on theory or practice? Thanks! (apart from repeating it afresh ha)

Happy New Year 2019! :)

clustering CD-HIT sort length • 2.3k views
ADD COMMENT
1
Entering edit mode

AFAIK, cd-hit will sort the input itself, so no need to do it yourself prior to running cd-hit.

Unless there are other reasons to first cluster them I would not take the effort myself and immediately proceed to running the blast. In that context I also advise you to request the tabular output (unless you are already doing so) to save quite some space for the output file.

ADD REPLY
0
Entering edit mode

Thank you! It explains why I could not remember the sorting step clearly, or find any utility that performs it explicitly. HNY! Cheers!

ADD REPLY
0
Entering edit mode

lieven.sterck : Apologies for high jacking this thread for a minute. You have been promoted to moderator on biostars. Please join the biostars slack channel as described here: Inviting NEW Biostars moderators to join Biostars slack channel

ADD REPLY
1
Entering edit mode

CDHIT requires no manual sorting.

ADD REPLY
1
Entering edit mode

Thanks for confirming. Cheers and HNY!

ADD REPLY

Login before adding your answer.

Traffic: 1640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6