Entering edit mode
8.2 years ago
nafizh
•
0
I have some bacteria dna sequences as a positive training set by some specific function of theirs. My negative training set would be whatever that does not fall into this category of positive training set. But I do not know in a sure fire way if a sequence falls into this positive category or not. So, how can I get sequences for a negative training data set? Can blast be used in such a way to get completely unrelated sequences to my sequences? Are there any other methods I can use?
Not clear as to what you mean by "some specific function of theirs"? Would that be a gene coding for something specific or a motif?
You could use synthetically generated sequence that is bound to not be positive.
for instance, by shuffling the dna sequences from the positive training set.
I have a set of experimentally verified sequences from bacteria that produce bacteriocins. That is my positive set. But I have no concrete evidence for what constitutes negative i.e. sequences that do not produce bacteriocins.
You could take sequences from rRNA (16S) or enzymes from glycolysis pathway. They are not likely to have anything to do with bacteriocin production.
But then what happens, if my test set has sequences from different areas than what you mentioned?
I thought that is what you are looking for (sequences that are totally different than your positive set)? Or am I missing something?
Sorry, maybe, I was not clear. My question was, during the testing, what if the test set has negative sequences from different areas than the negative sequences in the training set. Then can the classifier classify between the positive and the negative? Please, let me know, if you want me to clarify something.
Negative sequence should be just that (not related to bacteriocin production). Should not matter what area they come from if the function is all you are interested in, correct?
Yeah, I am only interested in finding out which ones belong to the positive set from new sequences.