Hi,
I am working on a classification problem using a sequence data. I have the positive data which belongs to a gene region I also have negative data which I have selected based upon the most common 5 nucleotides in the center. It seems like my model is over fitting and giving me very high accuracy I am not convinced if I have chosen my negative data correctly. I was wondering if any of the machine learning expert in bioinformatics could provide some wisdom or point me to a best practices paper. Would doing a blast against the positive sequence fasta db versus the rest of the gene regions and selecting the top x matches that have the motif be a better solution?
I am using one hot CNN inception model to do the training and prediction. My problem is that my data is imbalanced and the model trained on the current data set is biased. Above is the web logos of positive and negative sequences of the data
I feel that may be there is a better way to choose negative data. Currently I have taken sequences from gene that do not overlap the positive mRna fragments. I was wondering if there is a better approach that can be used for negative data selection.
Thanks
PS: I do use CD-hit to remove sequence redundancy and reduce sequence similarity
I don't think you're giving enough information for anyone to help you. It seems that you're not convinced that you're taking the right approach so what is the biological question you're trying to address ? What is the data ? How is it represented (i.e. what kind of features do you use) ? Which classifier do you want to use ?
Hi @Jean-Karim Heriche I have edited the question to provide more information. Looking forward to your reply. Thanks!
Just to add to @Jean-Karmin Heriche said, its difficult to know what makes a good negative set if we don't know what it is you are trying to train. Where have these positive examples come from? What features of them do you wish your net to learn? Are you sure you are overfitting? What is your performance on test data not used in training?
Why is your data imbalanced? If you are "creating" negatives, then presumably you can generate the same number as you have positives?
You still haven't told what it is you're trying to do, i.e. what are you trying to predict and what question is this suppose to answer ? If you're trying to detect a motif in sequences, you should try standard approaches first e.g. HMM-based. Just because deep learning is fashionable and applicable to many problems doesn't mean it's always a good idea. Also very deep networks are known to be susceptible to overfitting. There are many tricks to overcome overfitting in CNNs but the main one is simply to get gigantic training data sets.