I am using CD-HIT-2d to analyse CDSs shared in several genomic regions between 10 genomes. I have managed to obtain the clustering output data which provides the information on the number of species which contains homologous CDSs.
For downstream analysis, I need to identify the genomes which contain the CDSs to determine their degree of conservation among the 10 genomes. Currently I am manually analysing these datasets by grouping them according to the number of genome hits (example: CDS shared by 10 genomes, 9 genomes, 8 genomes and so on).
I would like to know if it is possible to have a text manipulation awk or bash script which could count the number of hits between the two cluster header and then group and export the data accordingly? As the number of datasets I have is rather huge, having this script would help to shorten the amount of time for analysis immensely.
Result sample is as shown below:
Thank you very much in advance for any suggestion and help
Eulnay
Many useful scripts ship with cd-hit, including clstr2txt.pl which transforms the cluster file into a format that is far easier to parse. I didn't really understand what you want to do, but some of the other scripts that ship with cd-hit might be even better suited for the task. Note, the documentation of these scripts is very poor and you will generally have to look at the code to see what they do..