Hi everyone!
I am senior year computer science student and I am considering choosing bioinformatics as a field for my ML Thesis.
Due to the fact that I lack in academic knowledge in genetics field, I would be thankful for any help with clarification my current draft, or providing any links to organizations which could help me finish my introduction (or cast it away)
Since I am still in research stage, sorry for for any fundamental errors in my questions' reasoning.
I plan the topic of my BSc Thesis to be "Machine learning application in diagnosis of a genetic diseases".
The idea is to create the model based on neural networks (with comparison of effectiveness of basic ML tools like lineal regression) that for given extracted genetic features decides whether this case is at risk of becoming ill. I would like to focus on rheumatic diseases (spondyloarthropathies as RA and AS) and work on publicly available datasets.
I found datasets located on "ncbi.nlm.nih.gov" genuinely compelling and selected two of them as relatable to my topic:
Collected gene expressions from 120 samples. An extracted RNA from white cells. Available as raw reads data in SRA browser. The description contains useful summary of data processing method. The output are feature counts on every "Ensembl ID" gene annotation for human genome.
"Screening genes associated with rheumatoid arthritis and ankylosing spondylitis" with 480 samples. The sample is considered on concrete group of "dbSNP" entries within custom human SNP list.
After creating the model with acceptable accuracy on test set I am planning to use regularization for discarding useless features. The goal is to decide whether there's or there's not a fixed group of features determining illness (e.g. B*27 SNP variant in HLA-B gene in 6th chromosome: https://www.snpedia.com/index.php/HLA-B)
My questions are:
Is it scientifically validated to create such model?
By this I mean: Is the genetic illness based on whole relation between many genes across chromosomes? (or single gene variant that triggers illness) Or is it much more complex issue that Machine Learning based on gene features is a miss?
Is it scientifically correct to collect data between different experiments? (e.g. use bot first and second mentioned dataset)
Could you briefly explain to me the relation between first and second mentioned dataset? (below I put my reasoning)
3.1. First one I understand as counted occurrences of every gene in sample (EnsemblID indicates e.g. ENSG00000223972 gene).
So the samples are indifferent to any variations in genes?
For example I can easily find HLA-B gene with counted rate but the sample does not provide which allele it is (to know whether is is HLA-B27* variant).
So the dataset provide the genome suit for every sample, but how understand number of counts?
The lesser the number is in concrete ENSG entry the more probability that sample does not have basis encoding but some SNP variation?
(as I understand sample might not have all genes (the file contains over 60'000 ENSGs))
3.2. The second dataset is only focused on given group of genes and focuses on SNPs so the different versions of given gene.
Why in this dataset the HLA genes are omitted as they are considered as crucial to these diseases?
As I understand if there was also HLA-B gene included the SNP set would also include HLA-B27 variation https://www.ncbi.nlm.nih.gov/snp/rs13202464
First dataset operates on RNA, second on DNA. Am I allowed to convert first dataset to DNA (with TopHat software: http://ccb.jhu.edu/software/tophat) and consider it as DNA?
Is 120 and 420 sample quantity enough to consider it for such research?
Should I focus on samples on genome level or rather samples considering genes variations (SNP level)?
Can I take any sample's raw reads from SRA, convert it to human genome with "Bowtie" software and consider it as operational genome sample?
Could you recommend any additional datasets platforms for my further search? Or people/organizations that would be keen on giving any advice?
Let me just ask this. How much work have you already put into this and what parts are still proposed future work?
From ML site I have solid experience, then I wish to use it in more complex application.
I have already analyzed mentioned datasets which shapes satisfies my basic requirements for ml (numerous features and coherent examples).
In genetics I have just started research so I would like to understand whether described approach from first paragraph is valid from "genetics science" point of view (as it works for image recognition).
I am aware that any model in such field requires also knowledge in it (so not to create model that defines relations between features that are false in reality) therefore if described idea is any valid I would encouraged to continue on delving the topic within BCs thesis.
The future work would focus on creating the model as described in first paragraph.
Your first data set contains measures of gene expression. This is a continuous measure of how "switched on" a gene is. To a first approximation, no information is captured about which sequence variant is found.
Your second dataset is about sequence variation. It is a categorical variable and measures which variant of the genes are present.
So, lets use an analogy. Lets say that we have five people, Bob, Jack, Sally and Anne and they are all drinking beer. The first dataset is like saying that Bob has a 500ml of beer and Jack has 350ml, while the second dataset is like saying that Sally has IPA and Anne has Pilsner.
Note that there is nothing in the first dataset about what type of beer Bob and Jack have, and nothing in the second that says how much beer Sally and Anne have.
Thus these two datasets are measuring different things and cannot be converted one to the other.
I don't know why the HLA-B SNP you are interested in isn't included. It could be it wasn't known about when this dataset was generated. It could be that it is not possible to measure it with the technology used in this dataset (its quite an old technology, and HLA is notoriously difficult to type).
With such a small number of SNPs as used here, there is almost certainly no call for an ML approach. Tradition approaches, such as a chi squared test on the frequency of the variant in sufferers vs. non-sufferers is almost certainly sufficient if there is signal to be detected.
If you wish to apply ML techniques on SNP data, i'd recommend getting hold of something done with a 500k SNP chip. This has 500,000 features, probably in ~O(10k) patients and here the benefits of ML might be more apparent. (No I don't know of any such datasets for your disease of interest, but there probably is one somewhere).
In general, i'd say this is a very ambitious project for an undergraduate thesis. Much more like a masters thesis, or the first part of a PhD. I definitely recommend that you find yourself some who knows about these data types, how they are generated etc.
+1 for beer based analogies.
You have included many questions in the post/comments above and it would be hard to address all of them in a forum like this. Not being from this subject area there are gaps in your knowledge, which is totally understandable (I don't know a lot about ML!).
It would be highly beneficial to find a local geneticist/bioinformatician/biologist. Someone you can have a coffee/beer with and talk through things. They can help you go over some of the basics of the type of data you are using, where and how it originates and what sort of inferences one can logically draw from it. My assumption is your supervisor is not a biologist or would likely have helped you before you got to this point.
For an undergraduate thesis this work should be fine. You just need to make sure that there are no scientific flaws and the conclusions are appropriately stated. All the best.
Thank you @genomax, @i.sudbery and @jrj.healey for such comprehensive answers!
In that case I will try get in touch with geneticist for further explanation.
@jrj.healey
the thesis is rather about showing application of ML tools and comparing them. It isn't definitely a serious research. Most ML BSc thesis at my university, which we are offered, cover basics of image recognition or language processing (thesis takes full year course).
In this case I thought that NCBI database would provide enough data for demonstrational use for thesis.
So as I understand publicly available datasets aren't sufficient for this?
Do you suggest that I would be more reasonable to decide on this topic only with custom dataset?
In an ideal world you would have your own custom dataset, but that's not to say that a public database would be no good. You might have to cast the net very wide to have enough data to train on though, and as Ian alluded to, the datasets you've identified so far are not equivalent in what they show, though there may be nothing wrong with using them individually.
You might have to do something like (off the top of my head), find all/as many datasets as you can that have looked for SNVs in the human genome (for instance), which are known to be associated with your disease of choice. It would probably be valid to combine these datasets so long as you know the ground truth for the study (i.e. in dataset X variant Y definitely causes Z, and in dataset A variant B definitely causes C, and so on and so forth). It is likely to be much easier to start with something 'static' like variants, rather than transcriptional data. The latter would require considerably between-experiment normalisation and you'd probably spend all year just wrangling the data rather than doing any actual learning. NCBI undoubtedly has enough data, but finding 'compatible' datasets is likely to be pretty arduous. If this is just a 'toy' project though, without any far reaching designs on publication/validation of the models etc etc then you can probably get away with using smaller datasets. The network will probably be a less effective predictor, but the actual engineering of the network/proof of principle would be a perfectly valid research project for a Bachelors program (even if you get to the end and have to conclude it is no good!).
Your premise, I believe, is sound. There are plenty of research groups (and even more companies) using AI techniques to probe for biomarkers - though how successful any of this has been is probably up for some debate. A 1 year project, if you have the right advice and a clear path/objective seems reasonable to me.
There are definately data out there. The GWAS catalog lists 64 Genome Wide Association Studies (GWAS) that have studied Arthritis, and they will have collected exactly the sort of data you want.
However, you may have trouble accessing it easily. Genotype information is uniquely identifiable, especially when whole genomes are considered. Thus it is often thought that there is no such thing as anonymous data in genomics studies (although there are people out their trying to create clever ways to anonymize genomic data) and so the data is not generally available to the public. In order to access the data, you would need your supervisor to apply for permission for your study and sign various guarantees that your institution IT systems were up to protecting the data, and that you are not going to try to identify people from it.
One source of genomic data that is out in the open is the Cancer Cell Line Encylopedia. Its not arthritis, but they have 1,000 cell lines with full genotype information, publicly available. All the lines are "cancer", but they have many different sorts of cancer, you could try to learn the difference between them.