Hello All ,
I am a computer scientist having little knowledge of genomic data . I want to use Artificial intelligence tools like machine learning to classify a particular cancer's molecular subtype.
I planned to use the RNAseq data but I am unsure what kind of data would work best for this problem , like should i be a gene expression data having genes as rows and samples as column or it should have samples as rows and from where can I get it ?
I would appreciate if anyone could guide me regarding this or provide me with any leads.
Thank you in advance
I wouldn't necessarily see the "normal samples" as contaminants. They could serve as more or less like "internal posivitive controls" for unsupervised learning approaches with the assumption that normal tissue samples are more similar to each other than tumor samples. Having said that, I am sure there will be cases where the biology of some normal samples will be closer to tumor samples rather than other normals.
You may have misunderstood. When biopsies of cancers are made, sometimes some non-malignant tissue is also harvested. This causes the "cancer" samples to in effect be a mixture of normal and cancer cells. The proportion of normal and cancer material in a sample varies a lot between biopsies and this can be an unwanted cause of noise in a dataset, which can be addressed by deconvolution techniques or by statistical correction.