Let me play the devil's advocate, as much as I understand Devon's sentiment.
"Can a Machine do better in selecting candidates than a human?"
Human decision making is not based on objective criteria, as hard as we may try, but on subjective preference, like 'interest', and in particular relies on 'homology'. One could argue that humans should not be allowed to select candidate genes for a routine test pipeline, when there are many more potential candidates than slots in the wet-lab validation pipeline.
In many cases we are trusting machines (aka. algorithms) to make essential decisions on a set of candidates. Imagine you are looking for information on the web, would you visit all documents and inspect them visually/manually? No! We let Google rank the documents for us, and often we trust the result (only ever using the first page or so), possibly too much (as we do not know the intricacies of its algorithm). There are many areas where machine learning, data mining is or will be applied in the future, think about automatic ranking of applicants, matching of partners, etc..
Formalizing selection criteria into an algorithm could be particularly useful to make the biological assumptions explicit, more than the subsequent ranking process itself.
Missing Objective Criteria and Gold Standard
Candidate selection or gene prioritization with respect to what? The problem is possibly the lack of objective criteria, and also
for most cases, the objectives will be different. Examples:
- Rank candidates for causative mutations
- Rank candidates for vaccine targets
Most likely you will not be able to use the same algorithm or heuristics for both.
but why are scientists not developing a (Me: as in single) good approach for gene prioritization (GP).
Because a single best approach will not work for different settings because there is not a single scoring function that solves all issues. As in any regression or machine learning application, the choice of features is crucial, and often more so than the choice of algorithm (SVM, random forest, linear regression, etc.)
Then, how to score your genes and evaluate that ranking is optimal?
A possible approach could be to define a very limited setting, e.g. prioritize the genes that will yield a phenotype on knock-out. This is in particular important in a setting where the ability of testing candidates is limited to a fraction of all genes, like with some non-model organisms (e.g. the salmon louse we are working with).
A little simplified example: We need new drug targets in Lepeophtheirus salmonis because it's an important salmon parasite, but we can maybe test 100 genes per year, if we wanted to knock-down all ~13k genes at least once, we would need 130 years. We need to prioritize our selection process to increase the number of successful knock-outs, if we don't get a phenotype out of an experiment we havne't really learned anything. If we simply choose the core metabolic or other lethal genes, these are often highly conserved and therefore maybe not so interesting (the host and other crustaceans in the sea have them possibly also). We need to rescue the genes that are unique or have little or no homology, while often people will pick genes that are well annotated by homology, and those without homology are understudied.
Currently, it seems that using network-based approaches based on gene expression data, we can outperform human- expert-based selection by a vast margin, but then this is hard to prove, because one needs to make experiments to validate these candidates.
That's a bit of a false premise in your head-line, because some researchers and groups are certainly working with gene prioritization, but let's not get distracted by that.
I think just using DGE results and selecting genes based on fold-change is not the most appropriate method. Literature search always lead to selection of KNOWN candidates, with very little to discover further. Most of these tools are PPI based and don't really give good guide for selection. It is very difficult to understand why first is first and last is last. Shouldn't there be a method which gives handle like p-value for selection of genes and if possible use all the available information in terms of GO, pathways and interaction?
99% of the time we only care about known and well characterized genes. Few labs want to dedicate the resources needed to characterize little-known genes, they have other goals.
I think that is the core of the problem. I remember there was a paper saying that we are still working on pretty much the same gene set as before the 'genomics' era.
No disagreement there, we need to fund characterization of a broader set of genes, there's a lot of known unknowns out there.
This is the paper you are thinking of: https://academic.oup.com/bioinformatics/article/34/12/2087/4816110
Wasn't this paper https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2006643
Thanks this is hot-hot, just published!!
Yes, just having the results of differential expression is not the best. A differential expression analysis is just a basic and standard analysis, from my perspective. However, I disagree with you that we should be using GO ontologies and pathways because, at least from my perspective, they introduce more noise and confusion into an analysis. This is independently verified in my conversations with numerous researchers. One colleague Professor in London referred to DAVID as 'pseudo-science'.
What comes after a differential expression analysis is follow up studies and predictive model building. I note that a substantial portion of researchers lack the ability to see the positive impact on society of their research (or they just do not think about it because they do not have to); in this way, some research groups have appeared 'directionless' and may aim instead to constantly seek out novel things to do, for publication purposes, as opposed to really aiming to make a difference and conferring a positive impact on society with their research.
'Machine learning' and AI are further algorithms that are chucking further noise into biological data... like bad fashion trends (like those purposefully torn and colour-faded jeans).
I don't think GOs and pathways introduce noise, just that we don't know how to reduce the noise. Non-gene centric classification systems are difficult to interpret and extrapolate to gene level. ML and AI are further atypical science for a 'statistical' biologist.
This comment was meant for the next comment.