How to get negative training set if I don't know what constitutes negative?

0

Entering edit mode

8.2 years ago

nafizh • 0

I have some bacteria dna sequences as a positive training set by some specific function of theirs. My negative training set would be whatever that does not fall into this category of positive training set. But I do not know in a sure fire way if a sequence falls into this positive category or not. So, how can I get sequences for a negative training data set? Can blast be used in such a way to get completely unrelated sequences to my sequences? Are there any other methods I can use?

machine-learning dna-sequence blast • 1.9k views

ADD COMMENT • link updated 12 months ago by Ram 44k • written 8.2 years ago by nafizh • 0

1

Entering edit mode

Not clear as to what you mean by "some specific function of theirs"? Would that be a gene coding for something specific or a motif?

You could use synthetically generated sequence that is bound to not be positive.

ADD REPLY • link 8.2 years ago by GenoMax 147k

1

Entering edit mode

for instance, by shuffling the dna sequences from the positive training set.

ADD REPLY • link 8.2 years ago by Carlo Yague 8.9k

0

Entering edit mode

I have a set of experimentally verified sequences from bacteria that produce bacteriocins. That is my positive set. But I have no concrete evidence for what constitutes negative i.e. sequences that do not produce bacteriocins.

ADD REPLY • link 8.2 years ago by nafizh • 0

1

Entering edit mode

You could take sequences from rRNA (16S) or enzymes from glycolysis pathway. They are not likely to have anything to do with bacteriocin production.

ADD REPLY • link 8.2 years ago by GenoMax 147k

0

Entering edit mode

But then what happens, if my test set has sequences from different areas than what you mentioned?

ADD REPLY • link 8.2 years ago by nafizh • 0

0

Entering edit mode

I thought that is what you are looking for (sequences that are totally different than your positive set)? Or am I missing something?

ADD REPLY • link 8.2 years ago by GenoMax 147k

0

Entering edit mode

Sorry, maybe, I was not clear. My question was, during the testing, what if the test set has negative sequences from different areas than the negative sequences in the training set. Then can the classifier classify between the positive and the negative? Please, let me know, if you want me to clarify something.

ADD REPLY • link 8.2 years ago by nafizh • 0

0

Entering edit mode

Negative sequence should be just that (not related to bacteriocin production). Should not matter what area they come from if the function is all you are interested in, correct?

ADD REPLY • link 8.2 years ago by GenoMax 147k

0

Entering edit mode

Yeah, I am only interested in finding out which ones belong to the positive set from new sequences.

ADD REPLY • link 8.1 years ago by nafizh • 0

Login before adding your answer.