I'm doing differential gene expression analysis in two type of leukemia I need normal data (RNA-seq data). the data were downloaded from DGC hub in the xena website
I have done the differential expression but I release i should use the healthy sampe as reference to give clear idea how the gene expression different between the normal and each type of leukemia.
If they do not provide healthy controls then these is probably ot much you can do about it. Using any unrelated dataset in the same statistical analysis is meaningless as you cannot distinguish biological effects from technical confounders. Also, you would need to carefully choose which healthy cells you actually consider an appropriate control. Would it be a healthy orogenitor cell, a monocyte, a granulocyte? That depends on the type of leukemia you are investigating and is not trivial. Tyically one compares disease subtypes with each other and then clusters them based on their relative differences in transcription. If you then have the hypothesis that certain leukemias derive from a certain cell type you would need to perform additional experiments to confirm that. There is no guarantee that each type of leukemia has the same cell-of-origin (not even discussing now that leukemia is not at all a precise disease category, as there are lymphoid and myeloid leukemias with all kinds of subtypes). If you compare any given sample with three different cell types from normal donors you'll get different results each time, so I think you need to define first based on your data what you want to compare with.
TCGA provide RNA-seq count for different type of cancer which already being used in differential gene expression analysis.
I manage to do the analysis and the result was fine for this compression only need of normal health sample to increase the specificity
that's the answer I wanted to give, but I really hoped that there is some super-smart way with Bayesian latent variable analysis or Deep Learning or AI published in Nature several months ago =( well, looks like nobody is aware of this...like, even PEER would not do this magic? https://www.ncbi.nlm.nih.gov/pubmed/22343431
there is two data sets for lung cancer if you want to do differential gene expression analysis you need to use normal sample(as control). If you just run the this two file you will get genes that differentiate between two subtype But what about normal you will need it to confirm that gene expression in three condition Normal, LUSC and LUAD
And normal data it should similar to cancer one in term of size or number of cases
Of course you'll get differential gene results. It doesn't mean its real or informative.
If you want to press on and waste your time analysing random, incompatible data from the internet, be my guest - but when it turns out to be a waste of time, and it will, don't say we didn't tell you.
I'd suggest to do cost-effectiveness analysis: how much money you'd spend for RNAseq of paired normal samples in order to get a limited list of strong candidates vs how much money you'd spend on validation of hundreds of genes.
After using intensive bioinformatics tools to identify list of gene for specific subtype of cancer the lab will be have high probability to get robust result
The point of @Joe was that it will be impossible to ensure that you "will be have high probability to get robust result", whatever bioinformatics tricks you apply, and after short thinking about that I subscribe to this point of view.
You can't (or at least shouldn't) compare samples you didn't sequence yourself under as close to identical conditions as possible.
There are likely to be significant batch effects between some random data from the internet and your own, that would obscure any real differences.
Thanks for reply I compare adult with paeditric in leukemia Aiming to have specific set of genes for each using TCGA and TARGET
the finding was good only need RNA-seq data from healthy sample
regards