I have a set of proteoform sequences that originate from many proteins from the same organism in fasta format. I am wanting to effectively remove the duplicates at 100% similarity but I would like to merge the headers from those duplicate sequences. What is an efficient way to accomplish this without having to spend hours parsing cd-hit .clstr files?
A tool that I see performs similar to this function is the mothur cluster.fragments() function, but, this has not been implemented for protein sequences.
For example I clustered the fasta file like so with cd-hit:
cd-hit -i peptchains.fasta -o peptchains_derep.fasta -c 1.00 -G 0 -aL 1 -AL 0 -aS 1 -AS 0 -n 5 -S 0 -M 600000 -T 0 -d 100
Parsing this output and merging these headers from the .clstr file is not straight forward from here.