cd-hit has given me an output that looks just like the example below.
>Cluster 0
0 496aa, >SRR5892231.2396932... *
>Cluster 1
0 496aa, >SRR5892231.3763255... *
1 390aa, >SRR5892231.1558909... at 91.03%
>Cluster 2
0 496aa, >SRR5892231.1710795... *
>Cluster 3
0 496aa, >SRR5892231.2083014... *
1 464aa, >SRR5892231.14158... at 91.59%
2 423aa, >SRR5892231.1116524... at 94.56%
3 314aa, >SRR5892231.1717279... at 95.86%
4 268aa, >SRR5892231.2309241... at 99.63%
5 371aa, >SRR5892231.480233... at 99.46%
>Cluster 4
0 496aa, >SRR5892231.3954388... *
1 319aa, >SRR5892231.1752373... at 99.69%
>Cluster 5
0 496aa, >SRR5892231.14746... *
>Cluster 6
0 496aa, >SRR5892231.2340653... *
1 407aa, >SRR5892231.2608197... at 100.00%
2 340aa, >SRR5892231.1216749... at 100.00%
3 345aa, >SRR5892231.3205930... at 92.46%
Each line that starts with a > shows the cluster label. The next line that ends with a ... * shows the representative sequence for the particular cluster.
How do I create a csv file containing the cluster label and its representative sequence?
(note: a cluster that appeared once in this file can re-appear somewhere in the middle of the file even though it is not shown in this case)
Thank you in advance for the help.
You can try using scripts provided by CD-HIT devs:
https://github.com/weizhongli/cdhit/blob/master/clstr2txt.pl
andhttps://github.com/weizhongli/cdhit/blob/master/clstr_select_rep.pl
.Thanks for this solution. This solution works if the representative sequence is exactly the next line after line starting with >Cluster. I noticed that some lines may be present as the third or fifth etc line. Therefore, it is better if we capture the line ending with *. I created my own solution (posted here) which does this using python.
If you can modify your solution using awk, that would be great.