I am trying to build a machine learning training set for bacterial protein sequences that form biofilm, and that cannot. I collected the positive sequences from the GO ontology website but for negative sequences I am not sure which sequences to incorporate into my training set.
Can anyone point me to resources for proteins sequences that are known to be not capable of forming biofilms?
What do you mean by proteins that form biofilms? Are you trying to find out what the major protein components of a biofilm are? Because bacteria, not proteins, form biofilms.
Essentially, yes, I am trying to detect proteins that are indispensable in forming biofilms. So, I need a negative set of protein sequences which definitely don't have that function.
I think you need to be very careful how you define 'indispensable'.
dnaA
for example, is obviously not a biofilm producing gene, but if you lacked the gene, you wouldn't get a biofilm, because the organism would be non-viable (as it's a required housekeeping gene).If it's sufficient that they don't have a primary functional role, then you could use standard, so-called
housekeeping genes
as negatives. These are easy to find in the literature as they're commonly used for negative controls in RT-PCR experiments. e.g.dnaA, gyrB, rpoA
etc.