How to split a protein sequence data into train and test set based on a sequence identity cutoff?
0
0
Entering edit mode
3 months ago
Nilavrah • 0

Hi all,

I have to split a protein sequence data into training and testing sets in a way that no sequence in the testing set has a sequence identity more than a certain threshold with any of the sequences in the training set. This ensures that the deep learning function classification model I have written can "transfer" the function to distinct members with the same ground truth function.

Until now I have tried to hierarchically cluster the sequences using cd-hit down to 30% sequence identity and then pick the representative sequences into the training set and any member of the cluster with less than the MTTSI threshold is put in the testing set. Is there any other way to split the sequences into training and testing sets with the defined sequence identity threshold and whether what I am doing is correct or not?

Thanks

clustering protein • 358 views
ADD COMMENT
1
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 1958 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6