How to split a protein sequence data into train and test set based on a sequence identity cutoff?

0

Entering edit mode

4 months ago

Nilavrah • 0

Hi all,

I have to split a protein sequence data into training and testing sets in a way that no sequence in the testing set has a sequence identity more than a certain threshold with any of the sequences in the training set. This ensures that the deep learning function classification model I have written can "transfer" the function to distinct members with the same ground truth function.

Until now I have tried to hierarchically cluster the sequences using cd-hit down to 30% sequence identity and then pick the representative sequences into the training set and any member of the cluster with less than the MTTSI threshold is put in the testing set. Is there any other way to split the sequences into training and testing sets with the defined sequence identity threshold and whether what I am doing is correct or not?

Thanks

clustering protein • 395 views

ADD COMMENT • link updated 4 months ago by Mensur Dlakic ★ 28k • written 4 months ago by Nilavrah • 0

1

Entering edit mode

This could be of interest:

https://www.biorxiv.org/content/10.1101/2024.07.12.603234v1.abstract

ADD REPLY • link 4 months ago by Mensur Dlakic ★ 28k

Login before adding your answer.