Entering edit mode
5.5 years ago
rafi.zon
▴
10
Hi there,
As the post title states I am trying to find an approach to construct a 5 or 10 fold cross validation dataset, applied to all of the currently available human proteins in Swiss-Prot (20.421 proteins).
Ideally, in each of the folds there should be the most similar proteins in terms of their sequence identity.
What can be a way to divide the proteins into the respective cross validation sets based on similarity?