Question

Creating a 5 or 10-fold cross validation based on sequence similarity of proteins

0

Entering edit mode

6.2 years ago

rafi.zon ▴ 10

Hi there,

As the post title states I am trying to find an approach to construct a 5 or 10 fold cross validation dataset, applied to all of the currently available human proteins in Swiss-Prot (20.421 proteins).
Ideally, in each of the folds there should be the most similar proteins in terms of their sequence identity.
What can be a way to divide the proteins into the respective cross validation sets based on similarity?

proteins cross-validation machine-learnng • 1.3k views

ADD COMMENT • link updated 5.1 years ago by trent • 0 • written 6.2 years ago by rafi.zon ▴ 10

score 0 · Answer 1 · 2020-07-03

0

Entering edit mode

5.1 years ago

trent • 0

I wrote how I did this with CD Hit here, hopefully it's helpful: https://www.trenthauck.com/cd-hit-cross-validation

ADD COMMENT • link 5.1 years ago by trent • 0