Dear community,
I want to build a machine learning classifier using a k-mer approach like RDP (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1950982/) using a different machine learning algorithm and building a different classifier (because RDP only goes to genus, but I need to species).
What I can do/have done: download reference sequences (~300.000 รก 900bp) > filter > annotate with taxonomic informaiton > k-mer library creation
What I am stuck on: setting up machine learning
Problem: Support vector machines (radial Kernel) and Random Forst learning times are so slow to the point they basically do not finish at all. I have enabled parallel computing, but it seems the paralell part is only applied to the repeated LOOCV, and not to the algorithm itself.
Does anybody have a clue what I can do next? I have access to an running R shell on a cluster that can use the doSNOW package. I also have a pretty fast machine right in front of me.
I haven't read the RDP paper but the title and the first line tell you that they use a naive Bayes classifier. I don't see any a priori reason for you not to do the same thing using species as class instead of genus. As for why what you're doing is slow, you'd have to profile your code to figure out where it is spending the most time. My guess is that the culprit is leave-one-out cross-validation because you need to build as many models as you have samples. Use k-fold cross-validation instead.
Yes I would love to do this. But this specific implementation of naive Bayes is hardcoded to only go down to genus. I use createMultiFolds, so I have only 5 LOOCV runs in total. I do not check every sequence.
Oftentimes even full 16S doesn't offer enough resolution for species level classification..
Good point.
Thank you for the information. We are testing new markers, so that I have 3 different markers per sample and an average of 3 samples per species. BLAST gets me to 65% identification success (using LOOCV) down to species on my first marker. But my problem really is that I cannot combine markers in BLAST. The potential gain calclulated in silico by adding marker 2 is +7% identification gain. But if I try to implement that in BLAST my error rate spikes like crazy. (I tried to concatenate or use some comparisons between bit scores and ismatches and such ...) I am trying machine learning, because almost every paper claims it is superior to BLAST. (Especially because my sequences have so many indels that I cannot align them.)
Blast is bad algorithm for this stuff. Try the RDP classifier from their github. You can get species level classification with it as long as you have a species level map file..
I know Blast is bad, that is why I am here. I didn't know about the fact that you can classify down to species, because every two sentences they mention the fact that it is limited to genus :D I will try to figure out how. Thanks! However, I need to transfer my method to plants, and therefore I need to re-evaluate which k-mer size is the optimal one and I need to compare different machine learning approaces, both things I cannot do in RDP. (see my latest post at the bottom, like this)
You don't have to use their code to do naive Bayes classification. You can use e.g. the naivebayes package.
look up the pybrain tutorial on neural networks for classification
I have no clue how to use phython. I can only use R.
R is in many cases a comparatively slow language. You may be able to speed up your code (particularly if you use the
apply
functions rather than loops) but you're probably taking a performance hit in using R one way or another.I have figured out the code. It's pretty easy using the caret package. So there is nothing to optimize really, because it is just 2 commands. That is not my problem. The code runs fine with 100 sequences, but when I ramp it up to 20.000 species it just never ends. Even if I don't use LOOCV at all. It does not even get to the LOOCV step, because the model initialisation takes forever. The LOOCV step is parallel. So I don't need to worry too much.
Check that you're not running out of memory. Swapping takes forever.
Like in all interpreted languages, R code is slower than compiled code. However, many computation-intensive routines in R are implemented in C/C++ or Fortran. For example, commonly used SVM functions in R are using libsvm. So the penalty you take in R usually comes from the code you write yourself and as suggested significant speed gains can come from vectorizing or parallelizing loops.
Sorry I do not see how I can loop if I just have one command? There is really no problem with R here because it is just one command that calls a C library. That's all what is done by R. So I have no code to optimize in the first place.
The LOOCV phase is parallized by default from the R package, so I do not need to worry about that. My point is that creating one model is computed on one core and that is my bottleneck. I can compute 10 models with different LOOCV on 10 different cores. I know that. I have a virtually unlimited number of cores in my cloud. But my problem is to distribute one model to multiple cores. Does something like this exist? Because one model takes too long to compute. I need to train 150000 observations with 1024 parameters belonging to 55000 classes in below one hour.
I can not pre cluster my observations, because sometimes I have two identical observations belonging to two different classes.
The vectorization remark was a general comment since I hadn't seen your code. Now that you're giving more details, I see another potential issue which is the high number of classes. Multiclass SVM is essentially implemented as a set of binary classifiers so SVM is not well suited for very large class numbers. In addition your number of samples is on the order of the number of classes which is not ideal. For this number of classes, you would want many more samples.
In your case, I would take a hierarchical approach since the taxomonic divisions are in a hierarchical relation. So I would first classify the samples into broader taxonomic divisions then refine the classification within each division.
Very cool answer!
So if I make a different model for each taxonomic hierarchy level. How do I combine them? Say I have 150 models, one for each family down to genus and 1500 models, one for each genus down to species.
Would you rather recommend random forest than radial SVM? I thought radial SVM is ideal, because if you image the genetic relations between species you end up with slightly overlapping bubble clouds.
Random forest is a good solution when there are many classes but it is slower than SVM. You're also not restricted to two levels of the hierarchy and you don't need to combine models. Once you've classified samples into a division, you just build a new model to refine this division using only the samples in that division. When you want to classify a new sample, you first classify it with the first (i.e. top level) model then successively with the model corresponding to the previous category.
https://biodatamining.biomedcentral.com/articles/10.1186/1756-0381-7-4#MOESM2
I am trying to recreate this. The differences are that I need to use k-mers instead instead of sequence positions (because I can not align my data) and that I have 300.000 sequences instead of 250. Seems more difficult than I thought. (I was the naive one (hoho)).
I will try to implement the split apporach. Although it is currently unclear to me how I can give generalized success chances if I do not know which sub-model my sequence will end up in. Another thing is that the GenBank Taxonomy is really complicated and outdated, so I rather would try to avoid "human defined barriers" like family or genus and get right to the biological level (species).
Then I will try the speed of the naive bayes classifier, using the package you liked me. See where where this will take me. Thanks! All comments have been really helpful.
Greetings to Sergei and Dario, if you know them. Great Summer School! (edit: oh, EMBL is really big, probably not)
PS: Do k-mer classifiers use absence/presence of k-mers (per sequence) or propability of k-mer per sequence or counf of k-mer per sequence?