Creating a 5 or 10-fold cross validation based on sequence similarity of proteins
1
0
Entering edit mode
5.5 years ago
rafi.zon ▴ 10

Hi there,

As the post title states I am trying to find an approach to construct a 5 or 10 fold cross validation dataset, applied to all of the currently available human proteins in Swiss-Prot (20.421 proteins).
Ideally, in each of the folds there should be the most similar proteins in terms of their sequence identity.
What can be a way to divide the proteins into the respective cross validation sets based on similarity?

proteins cross-validation machine-learnng • 1.2k views
ADD COMMENT
0
Entering edit mode
4.4 years ago
trent • 0

I wrote how I did this with CD Hit here, hopefully it's helpful: https://www.trenthauck.com/cd-hit-cross-validation

ADD COMMENT

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6