I have cluster sequences using cd-hit-est and now I want to filter out the parent or the representative sequence out of the cluster. Any suggestions?
I have cluster sequences using cd-hit-est and now I want to filter out the parent or the representative sequence out of the cluster. Any suggestions?
You can use the * symbol at the end of the line to pull out the representative sequence. I think that this script I wrote, will pull out the representative sequence: https://github.com/josephhughes/TCRclust/blob/master/sort_cdhit.pl
using:
sort-cdhit.pl -i INFILE.fa -o OUTFILE_rep.fa -clstr INFILE.clstr -rep
You will need to make sure you use the option -d 0
when you run cd-hit to be sure to get the complete identifier in the .clstr output file.
Yes. grep
.
Read the user guide - it mentions a pattern you can use to isolate the representative sequences.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
There's also the included clstr2txt script that converts the output into a more parsing friendly format.
@Ram ,there is no such pattern mentioned dere...sorry if I am missing it