Novel sparse optimisation for high accuracy cancer classification on the 33-cancer TCGA gene-expression dataset
3
0
Entering edit mode
10 months ago
Mark Reilly ▴ 20

Hi, I've had a 20yr career in AI/DL, most recently successfully applied in finance, elevating me to Head of Research during that time based on the success I had. Having left finance almost 2yrs ago I'd been looking to apply my skillset to the non-finance world, and recently came across the 33-cancer TCGA gene-expression dataset. Having also recently developed a novel approach to sparsity on high dimensional problems that does away entirely with the L1 and L2 weight penalties of LASSO & Elastic Net, this dataset seemed a worthy testbed.

Although the initial sparse optimiser I had built was for linear regression, it still gave excellent results on the UCI PANCAN 5-cancer subset dataset, just MSE training OvR on +1/-1 targets. After solving for the various loss functions of Logistic Regression, SVM, and SVM with squared hinge-loss, I set about applying my three classification-based sparse optimisers to the full 33-cancer dataset.

The specific dataset I chose is the batch-adjusted files at https://gdc.cancer.gov/about-data/publications/pancanatlas .

Unfortunately, not all the TCGA samples are labelled in the info text file there, but I managed to fill in gaps with another TCGA sample list elsewhere, to get my sample count to 10283, which excluded all the off-site normal tissues and duplicates. I also removed all 4196 genes that had one or more NAs in them (genes removed on a sample by the batch correction process), resulting in 20531-4196=16335 features total.

Although I couldn't quite perfectly align my sample set with their 10267 sample set, the current state-of-the-art on the PANCAN dataset, from what I can see, is the DL method MI_DenseNetCAM:

https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2021.670232/full

This achieves a 96.8% multi-class classification accuracy on the full 33-cancer dataset using 10-fold x-validation, utilising a shared set of 3600 genes per cancer-type/class, and a total parameter count of 13.9M parameters.

Although I have seen other papers report higher accuracy on the PANCAN dataset, they have all been on much smaller subsets of cancers. They often did not appear to be as robust in their procedures either.

Utilising 10-fold x-validation, my sparse optimisation method* achieves 97.2% accuracy on the full 33-cancer dataset. However, it does this with an average of only ~400 genes per cancer, and ~13K parameters total, i.e. 1/1000th of the parameters of MI_DenseNetCAM and 1/9th of the per cancer gene-count.

Even more remarkable* is that 96.4% accuracy can still be achieved with only ~800 parameters total (not per cancer) and an average of 24 genes per cancer type. This level of accuracy seems unprecedented for such sparse models.

My method also achieves 69% accuracy on READ, a cancer that most other models achieve 0% on, as it's difficult to differentiate from COAD.

One other interesting fact is that there is very little overlap between the genes selected for each cancer type on the sparsest model, and each cancer appears to have a smallish set of signature genes.

As someone with no background in Bioinformatics and Genomics, it appears to me that the ability of my algorithm to zone-in on the cancer signatures, should be very useful in the development of targeted treatments, and efficient diagnosis tools. I've come to this forum to ask for advice, guidance & thoughts on what my next steps should be, what the likely applications of my method are and challenges I may still need to overcome. Whether it's to write a paper, open source it, licence it, raise investment and start a company, I'm open to all well-argued opinions!

I'm happy to provide more details on the sample sets and other experimental setup details, along with full in-sample/out-of-sample stats. I'm also very happy to share any of the sparse model files for discussion on any of the individual 33-cancers.

An example of a very sparse set of genes that classified the cancer LAML with 100% accuracy OOS is below:

"NFE2|4778" : {"weight": 0.152544975, "mean": 3.99823, "stddev": 2.22415},

"ATP8B4|79895" : {"weight": 0.119082607, "mean": 5.8553, "stddev": 1.62709},

"RPL7|6129" : {"weight": 0.0841606408, "mean": 10.4455, "stddev": 1.09871},

"MTA3|57504" : {"weight": -0.0870735943, "mean": 9.6818, "stddev": 0.734529},

"LGMN|5641" : {"weight": -0.13933, "mean": 11.2215, "stddev": 1.1614},

"BCAR1|9564" : {"weight": -0.165008873, "mean": 10.5392, "stddev": 1.3575},

"bias" : {"weight": -1.44177818}

The same 6 genes above were selected by the best optimiser, for each of the 10-fold x-val runs. The specific weights shown were obtained by training on the full dataset. The 'mean' and 'stddev' values are of the training set features after the log2(1+x) transformation and are used to standardise the data before optimisation. Note that the bottom three gene-expressions have negative weights.

I look forward to your comments and thoughts!

Mark

*Achieved using my modified SVM approach with squared hinge-loss, although my modified Logistic Regression method is only marginally less good. My modified SVM with the standard hinge loss is a little worse that the others.

TCGA PANCAN cancer-classification gene-expression • 2.1k views
ADD COMMENT
1
Entering edit mode

I remember this pancancer classification task being done to death awhile back - here with miRNAs only. Be good to know how this compares https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5389567/

ADD REPLY
0
Entering edit mode
10 months ago
Zhenyu Zhang ★ 1.2k

If this is real, it's pretty good "My method also achieves 69% accuracy on READ, a cancer that most other models achieve 0% on, as it's difficult to differentiate from COAD."

have 2 comments:

  1. why not publish your work first?
  2. you need to validate in non-TCGA dataset, or otherwise you run the risk of picking up artifacts.
ADD COMMENT
0
Entering edit mode

Hi Zhenyu. Publishing is definitely an option, but as someone new to bioinformatics and genomics, I run the risk of not getting published, having the paper lost in the sea of similar papers. I would, however, be comfortable publishing this as a ML paper as that's my comfort zone. I would really like to use it to discover and/or identify something that could translate to clinical utility.

At some point it'll need to be demonstrated on non-TCGA data. I have seen enough papers that have demonstrated that their TCGA trained models work on non-TCGA samples

ADD REPLY
0
Entering edit mode

There doesn't seem to be a DM option in biostars, but I'd be happy to reach out and discuss offline.

ADD REPLY
0
Entering edit mode
10 months ago
Mark Reilly ▴ 20

As a follow-up, someone directed me to the TCGA curated subtype dataset, which I've also applied my method to:

https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/subtypes.html

Despite the fact that the individual sample counts on many of the subtypes is very small (the average sample count per subtype is ~60), my sparse-optimisation classifier achieved an 82% accuracy overall across 83 cancer subtypes (after exclusion of subtypes with less than 20 samples) on a total of 6596 samples, across 24 primary cancer types, using 10-fold x-validation.

Although some subtypes were clearly more difficult to differentiate than others, performance was not overly degraded by a low sample size of individual subtypes. For example 97% classification accuracy was achieved on 4 PCPG subtypes, despite low sample counts (Cortical-admixture=22, Kinase-signaling=68, Pseudohypoxia=61, Wnt-altered=22). On the sparsest model, 92% classification accuracy was still achieved with between 7 to 10 genes per PCPG subtype.

As an experienced AI/ML person new to bioinformatics & genomics, I'm keen to learn where my technology can have the greatest impact, and I'm extremely grateful to the comments I've had so far.

Given that subtype determination seems to greatly impact treatment plan and prognosis, what cancer subtypes would have the greatest clinical impact?

Below is a table of RECALL, PRECISION & F1 out-of-sample scores, along with average model gene count across all the subtypes. Given the sheer number of subtypes, I thought it would be useful to share the table, in case there were specific subtypes that were known to be difficult to differentiate and/or had high impact on clinical outcomes, and also to highlight if I'm making any novice mistakes by not grouping some of the subtypes for example.

I also have an equivalent table for my sparse model that achieves 80% subtype classification accuracy overall with an average of only 17 genes per subtype. It appears this could be useful in the development of Next Generation Sequencing (NGS) targeted panels, for which development often seems to focus on the identification of cancer subtype through identification of a small sets of biomarker genes.

Subtype Results Table

ADD COMMENT
0
Entering edit mode
10 months ago
Mensur Dlakic ★ 28k

I will start by saying that I am skeptical of anyone who just joined the field and instantly gets a result that tops many years of research. There is nothing personal in that attitude, nor am I being protective of previous research, as this is not my field of interest. Even though you seem to be aware of it, I want to reiterate that there will be others out there who share my opinion.

You keep saying that these results are from 10-fold cross-validation. Am I correct in assuming none of them were tested on a held-out test dataset? If so, then I'd be double skeptical about your bottom-line results. As the purpose of CV is to estimate model performance on unseen data, your CV results only indicate that your model is of good quality. A slightly different distribution in unseen data compared to your training subset could lower the accuracy to maybe 95%. Your model would still be very good and your CV estimate would still be reliable, but the actual result on unseen could be about the same as previous methods. Unless you are testing this on a large enough test dataset that has been held out of training, your CV results are just estimates.

About your general approach: sparsifying data and/or feature elimination are standard approaches for systems with many variables. I suspect that the dataset is likely short on data for at least some cancer classes. I don't know the literature in this field so this is purely a conjecture: others have tried some variant of this approach before, because it is too obvious not to have been attempted. If I were in your shoes, I would not be convinced that my method works well just because my CV is higher. My suggestion is to be sure that your model is better, rather than just appearing to be better, before you think about saving the world or even publishing this work.

ADD COMMENT
2
Entering edit mode

My suggestion is to be sure that your model is better, rather than just appearing to be better, before you think about saving the world or even publishing this work.

Nonsense! The recasting of the sparsity optimization problem implied by this:

Having also recently developed a novel approach to sparsity on high dimensional problems that does away entirely with the L1 and L2 weight penalties of LASSO & Elastic Net

is likely worthy of publication by itself -- to the extent it's independent from other extensions of compressive sensing. If multi-class contrasts come out "for free" then all the better! As such the application to TCGA is a proof-of-concept demonstration. Mark Reilly the reason this is not super meaningful within the general context is that all of these cancers have near-perfect biomarkers (morphology, karyotyping, surface proteins, etc) that make RNA-based classification a non-critical task; and separating a "bona-fide cancer" gene from a "cell-of-origin" marker would remain a downstream task for all the resulting gene lists.

That said, the application of this method to (prognostic or post-hoc) refractory signatures [i.e., whether the tumor actually responded to the treatment] -- which is where sparse models are very commonly applied -- may be rewarding. Indeed, the trend of literature in group/guided LASSO; structured sparsity; sparse Bayesian models; etc has shifted towards response prediction or hazard modeling.

I would be interested in seeing performance in identifying cell subtype markers in single-cell data. The bar is very low in this area as there tend to be a high number of classes with a wide spectrum of abundance. The default approach is still either one-vs-one or one-vs-rest T-tests of various sorts, and so could be significantly improved.

ADD REPLY
0
Entering edit mode

Thank you for your reply and insights. Do you have any pointers to data on cell subtype markers in single-cell data? I found: ACTINN: automated identification of cell types in single cell RNA sequencing with some accompanying datasets, but if you have any specific public datasets in mind, I'd be very grateful for suggestions.

ADD REPLY
1
Entering edit mode

Cellxgene (https://cellxgene.cziscience.com/datasets) has a large compendium of single-cell datasets.

ADD REPLY
0
Entering edit mode

Hi Mensur,

I was beginning to think that nobody was going to take the time to reply, so thank you for doing so! Given your initial comments, I think it's probably worth me writing a little on my professional background.

I obtained my 1st degree and MEng at Trinity College, Cambridge in 1996, having specialised in AI & signal processing. Since then, I've built a career in the practical application of AI in challenging environments, especially those with a low signal to noise and/or hostile environments. From 2004-2011 I built world-class poker AI using Deep Neural Networks, MC and Bayesian Methodologies, beating the University of Alberta's high-profile Polaris pokerbot in a lengthy statistically significant, online Heads-up Limit Holdem match at their acceptance. The UoA's model was essentially trained on a supercomputer. My model utilised adaptive AI & deep neural networks in real-time to learn how to exploit its opponent. Not long after this victory the UoA solved the Nash-Equilibrium for 2-player Limit-Holdem.

In 2012 I entered finance, and in a relatively short period of time started generating 10s of millions of dollars using Deep Neural Networks and Reinforcement Learning with a low latency, high-frequency, high Sharpe Ratio trading strategy. To achieve this, I built the entire AI-based trading model training platform, with walk-forward training, feature pruning, and developed an exceptionally robust process for generating high-confidence models that had a near-identical realised return profile to the trained models. This achievement elevated me to Head of Research where I managed and mentored the company's team of quants and researchers. In my lengthy professional experience, I have sliced, diced, trained on, every type of data conceivable. I have hand-written from scratch whole libraries of algorithms, many of my own creation, with all manner of regularisation methods, including various exotic distributions and parameter-sharing methods you almost certainly have never heard of.

It is true that I am new to Bioinformatics & Genomics. However, I am not new to data like TCGA. I'd say that compared to all the crazy real-world data I've had to handle, it's one of the best put together sets of data I've had to ever train a model on. The individuals that put this data set together, should be very proud of what they've done, despite whatever limitations others may highlight about the TCGA data.

To address the other points in your reply.

10-fold X-validation. 1) The paper I reference (and compare to) is using 10-fold cross validation in the same way. I am clearly comparing apples with apples. 2) Without a true, independent and unseen test data set, there is no such thing as a test data set. All prior-known/available test sets are optimised on, one way or another, by researchers to get the best test result. Even when true test data sets are available, they are can often too small (for small, high dimensional datasets like TCGA) compared to the training dataset to be particularly meaningful. The 'gold standard' of out-of-sample testing on non-timeseries data is Leave-one-out-cross-validation (LOOCV). Here every sample is used once in test, and the model is trained on (N-1)/N of the training samples N times. Apart from linear regression, this is usually far too costly to do, so we use the 'silver standard' which is (k>1)-fold cross-validation, with k=10 a very good benchmark level. Whilst not as good as LOOCV, you still have as many out-of-sample data points as training samples. Compared to leaving the researcher to choose which 10-25% of the training set he'll 'honestly' set to one side on day one and never touch until testing day, K-fold cross-validation is a significantly more robust method for determination of generalisation performance. K-fold cross-validation is a repeatable, low-variance, measure of OOS performance. With a high enough K, it is much more difficult to hill-climb compared to a small, fixed hold-out-set. It is true that a researcher could still grid search (or EA) optimise their hyperparameters and not tell anyone. However, on 10-fold x-validation this has far less utility than doing it on a small holdout set and not telling anyone. I wouldn't entertain the conclusions of any ML paper that didn't employ 10-fold (or better) x-validation on a dataset small enough to be prone to overfitting.

Sparse Optimisation

In my time in AI & finance, I have hand-written and tested numerous sparsity inducing regularisation methods, from L1/LASSO/Elastic Net, L^p (where p < 1, i.e. fractional norm), linear programming methods, greedy (orthogonal) matching pursuit algorithms. All norm-based sparsity methods suffer the problem of conflating the norms/magnitudes of parameters with sparsity. Sparsity has nothing to do with the norms of the parameters, only the count of non-zero parameters is important. My novel approach has been to decouple parameter optimisation from the question of where a parameter should exist or not. It turns out that in doing so, sparsity is preferred, even when it is not encouraged with hyperparameters. The optimisation 'chooses' to zero out features that are noisy and unreliable. With some hyperparameters to encourage additional sparsity, the model can produce exceptionally sparse, high performing models. Increasing sparsity this way enables the model to achieve high out-of-sample performance on very small sample sizes (see PCPG subtype performance above). This is not an artefact of the data, or the training, or the 10x-fold cross-validation.

As others have pointed out to me on another Bioinformatics forum, my results, whilst an impressive demonstration of the algorithm’s optimisation/classification capabilities, are not proof of real-world clinical utility. Some subtype determinations might be useful, as might some relevant biomarkers, especially if backed up by other means, but it is some distance away from being useful or relevant to the real-world without some adaptation to solve the clinically useful questions.

I know that I do not have enough domain knowledge myself to apply this to the real-world application for useful clinical utility. This is why I’ve come this to forum, in the hope that I might gain others’ insights to point me in the right direction.

ADD REPLY
1
Entering edit mode

I obtained my 1st degree and MEng at Trinity College, Cambridge in 1996, having specialised in AI & signal processing. Since then, I've built a career in the practical application of AI in challenging environments, especially those with a low signal to noise and/or hostile environments. From 2004-2011 I built world-class poker AI using Deep Neural Networks, MC and Bayesian Methodologies, beating the University of Alberta's high-profile Polaris pokerbot in a lengthy statistically significant, online Heads-up Limit Holdem match at their acceptance. The UoA's model was essentially trained on a supercomputer. My model utilised adaptive AI & deep neural networks in real-time to learn how to exploit its opponent. Not long after this victory the UoA solved the Nash-Equilibrium for 2-player Limit-Holdem.

In 2012 I entered finance, and in a relatively short period of time started generating 10s of millions of dollars using Deep Neural Networks and Reinforcement Learning with a low latency, high-frequency, high Sharpe Ratio trading strategy. To achieve this, I built the entire AI-based trading model training platform, with walk-forward training, feature pruning, and developed an exceptionally robust process for generating high-confidence models that had a near-identical realised return profile to the trained models. This achievement elevated me to Head of Research where I managed and mentored the company's team of quants and researchers. In my lengthy professional experience, I have sliced, diced, trained on, every type of data conceivable. I have hand-written from scratch whole libraries of algorithms, many of my own creation, with all manner of regularisation methods, including various exotic distributions and parameter-sharing methods you almost certainly have never heard of.

This is starting to sound like a LARP.

If you have a good sparse ML algorithm, publish or post it.

If you want to establish your bona-fides, post your LinkedIn.

This right here:

I have hand-written from scratch whole libraries of algorithms, many of my own creation, with all manner of regularisation methods, including various exotic distributions and parameter-sharing methods you almost certainly have never heard of.

has the same energy as this:

I'll have you know I graduated top of my class in the Navy Seals, and I've been involved in numerous secret raids on Al-Quaeda, and I have over 300 confirmed kills. I am trained in gorilla warfare and I'm the top sniper in the entire US armed forces

ADD REPLY
0
Entering edit mode

This wasn't my intention, and I clearly failed if I've diverted attention away from discussions about the problems I'd like to know more about how I can solve.

ADD REPLY
1
Entering edit mode

I deliberately wrote my original comment in a sobering tone, so I will start this one with an inspirational story. It is about a gentleman named Robert C. Edgar, who has a non-biology degree and is a programming wizard. I think he worked in academia, then in industry, made a lot of money and retired early. I think he then realized that being retired with a lot of money doesn't necessarily provide enough intellectual challenge, so he teamed up with more biologically oriented people to develop software that could be used to solve their problems. Any of this sound familiar?

Bob Edgar went on to make a stunning contribution to the field and created some of the most useful programs for biologists, and did so as the first (and often the only) author on the resulting papers. Pretty sure he even made some extra money along the way that he probably didn't need. I think you know how to research the rest of this story, but I will conclude this part by saying that a Bob Edgar exception only makes me slightly less skeptical about many others who have similar ambitions.

I think you might have misunderstood my suggestion about held-out data. The suggestion was not to split the data to train and test, then make a single model on train data and validate it on the held-out subset. That is an old and largely abandoned way of doing ML, even with much larger datasets and with fewer classes than in your case. My suggestion is to split the data, do an N-fold CV on train data, and then get the average performance of N models on the held-out data. Either that, or do N-fold on the whole dataset but find independent test data. Splitting data such that train and test datasets have identical data point distributions is a non-trivial problems, for the reasons that are laid out below.

You say that 10-fold CV you did is apples to apples with what other have done. How do you know that? Are the 10 folds you used the same as fold distributions in other papers? If not, then it isn't apples to apples. I agree with you that N-fold CV generally has low variance, but that premise holds best on large dataset with small class numbers. The smaller the dataset and the more imbalanced classes we have in them, the less likely is that the low variance of N-fold validation will hold. After all, we are talking about small differences between your results and previous efforts. Is it possible that your method would get 95.6% instead of 96.8% if you were using the same fold splits as in the other papers? Absolutely. For really small datasets with many classes, it is often the case that randomly obtained folds are unequal in terms of class distribution. I don't mean unequal in terms of class member numbers, but in terms of data points that are easy versus difficult to classify. If we have 100 data points in class F and each fold gets 10 of those no matter what random seed we use to split our folds, that doesn't automatically mean that all 10-fold splits are equivalent. I don't want to hand-wave here why that's the case because you can test this on your own, but here is what I would do: make sure not only that in each fold we have the same number of class members, but that we also have the same proportion of difficult class members. There are numerous examples in Kaggle competitions - where incidentally there is always a held-out test dataset - that competitors get higher N-fold CV scores for no reason other than luckily stumbling on a favorable random seed when splitting folds. In Kaggle environment better CV scores often do not translate to better results on held-out data.

You don't have to take my word for any of this. Do another 10-50 10-fold splits with different random seeds for fold generation and see how many times you get exactly the same performance as quoted above.

Everyone likes to think that their ideas are unique and revolutionary, and maybe yours are in that category. What I know for sure is that there are many approaches that are conceptually similar to what you are doing. Whether one is removing complete features (Lasso), or sparsifying the data without removing any features, or making dense data with fewer dimensions (SVD), or doing non-linear dimensionality reduction (t-SNE, UMAP), or doing representation learning of one kind or another (autoencoders), it is possible to find a method and a combination of data distributions that may seem to work better. This is especially true when our main measure of model quality is N-fold CV and there is no direct test of generality on new data.

Since this will likely be my final post - and definitely my final detailed post - here is a piece of advice. If you are onto something here that could change treatments and make some money, you may want to avoid discussing it too much in the open forum or be in a haste to publish. While the patent laws are different in the US and other countries, depending on the depth of discussion, and certainly after the publication, you would either forfeit the patent right, or be on a very short clock to file a patent. Then again, maybe you already made all the money you will ever need, and now you want to do something that will be useful to others. Either way I think Robert Edgars of the world will be rooting for you, and RCE himself may even want to collaborate.

ADD REPLY
1
Entering edit mode

Great, thank you for the useful pointers in here, and sorry if the tone of my reply sounded a little defensive. RCE's story sounds very familiar to me. My primary goal is not money, so have already ruled out pursuing a patent. I know I have very little chance of bringing about any impact by myself, and frankly would love nothing more than to find the appropriate person to collaborate with.

ADD REPLY
1
Entering edit mode

One thing on the 10-fold x-validation I meant to say was that I deliberately do not use random selection. I sort all samples by cancer, and then by sample-id. I then pick the ith in 10 sample for i=1:10. This always picks an appropriate split of training and test samples. This should also evenly sample across the various batches and origin labs (assuming that TCGA sample-ids are grouped?). I could consider sorting the samples within a class by their OOS classification difficulty so that these are used evenly as test samples. This should hopefully find an accurate lower-bound OOS performance.

Finally on the Kaggle competitions, yes, when there is someone who is managing the test set that way, then this is preferred.

ADD REPLY

Login before adding your answer.

Traffic: 3776 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6