Question

CD-HIT results without sorting by length

0

Entering edit mode

5.9 years ago

Anand Rao ▴ 640

Greetings!

AFAIK, CD-HIT requires sorting by length before performing the clustering step. Am I right? If yes, please read on. If not, then there is no question :)

I used CD-HIT based clustering to remove 100% identical sequences after retaining just one representative, and then proceed with a BLASTp all-by-all. This final file is large ~ 170GB.

But I just remembered I did not sort by length initially before the CD-HIT 100% nr step!

With that as context, I have a few questions:

1. A while back, I remember sorting my input by length, before the CD-HIT step per se But now, I can't seem to remember if it was a Perl script or some other executable inside CD-HIT or from elsewhere. Can someone help?

2. What happens to the validity of my results if my clustering (at 100% identity) was performed without sorting by length?

I do not mind having a few additional sequences that should not have been there for the BLASTp step, BUT it would be a problem if sequences were removed that should have been retained in the results file (used for BLASTp).

3. Does anyone have advice based on theory or practice? Thanks! (apart from repeating it afresh ha)

Happy New Year 2019! :)

clustering CD-HIT sort length • 2.3k views

ADD COMMENT • link 5.9 years ago by Anand Rao ▴ 640

1

Entering edit mode

AFAIK, cd-hit will sort the input itself, so no need to do it yourself prior to running cd-hit.

Unless there are other reasons to first cluster them I would not take the effort myself and immediately proceed to running the blast. In that context I also advise you to request the tabular output (unless you are already doing so) to save quite some space for the output file.

ADD REPLY • link 5.9 years ago by lieven.sterck 15k

0

Entering edit mode

Thank you! It explains why I could not remember the sorting step clearly, or find any utility that performs it explicitly. HNY! Cheers!

ADD REPLY • link 5.9 years ago by Anand Rao ▴ 640

0

Entering edit mode

lieven.sterck : Apologies for high jacking this thread for a minute. You have been promoted to moderator on biostars. Please join the biostars slack channel as described here: Inviting NEW Biostars moderators to join Biostars slack channel

ADD REPLY • link 5.9 years ago by GenoMax 147k

1

Entering edit mode

CDHIT requires no manual sorting.

ADD REPLY • link 5.9 years ago by Joe 21k

1

Entering edit mode

Thanks for confirming. Cheers and HNY!

ADD REPLY • link 5.9 years ago by Anand Rao ▴ 640