Hi all,
I have to split a protein sequence data into training and testing sets in a way that no sequence in the testing set has a sequence identity more than a certain threshold with any of the sequences in the training set. This ensures that the deep learning function classification model I have written can "transfer" the function to distinct members with the same ground truth function.
Until now I have tried to hierarchically cluster the sequences using cd-hit down to 30% sequence identity and then pick the representative sequences into the training set and any member of the cluster with less than the MTTSI threshold is put in the testing set. Is there any other way to split the sequences into training and testing sets with the defined sequence identity threshold and whether what I am doing is correct or not?
Thanks
This could be of interest:
https://www.biorxiv.org/content/10.1101/2024.07.12.603234v1.abstract