how will we get know about the removal of paralogous sequences by running CD-Hit ?? how can we identify paralogous sequences from output ftext files of list of clusters ??
CD-HIT is a sequence clustering tool and it simply clusters the sequences based on applied sequence identity threshold specified using -c. If the paralog sequences fall within the defined threshold then they would be clustered together with the longest sequence chosen as a representative for the cluster.
CD-HIT github page provides a number of scripts to parse the standard clustering output.
Alright I had given arguement -c specifying sequence identity threshold. Can you tell me what is the next step to do ?? as my further step is to do blastp against human genome to get non-homologous sequences .. how to correlate with the output of cd-hit to the blastp ??
Could you clarify what you are trying to do and if it is unrelated to the cd-hit question posted above then please create a new post explaining the aim?
Alright I had given arguement -c specifying sequence identity threshold. Can you tell me what is the next step to do ?? as my further step is to do blastp against human genome to get non-homologous sequences .. how to correlate with the output of cd-hit to the blastp ??
Could you clarify what you are trying to do and if it is unrelated to the cd-hit question posted above then please create a new post explaining the aim?