Dear BioStars Forum,
first of all I would like to thank the people in this forum for building this invaluable resource. I am still a beginner when it comes to RNA sequencing data analysis and my knowledge of proper methods is still rather limited. However, I feel that reading many of the answers on here has greatly helped my understanding so far.
I am currently working with the TCGA RNA sequencing datasets. In particular, I am inspecting several cohorts and am looking to perform differential gene expression analysis comparing the cancer samples to samples of normal tissue. Ideally, this would produce a list of differentially expressed genes for each of the inspected tumor types which I could use to further compare other results to. However, two factors complicate this intention:
- Not all cohorts that I would like to investigate have normal tissue samples provided by the TCGA project. One such example would be the ACC (adrenocortical carcinoma) cohort.
- There is concern that even for those cohorts for which normal tissue samples are provided, the normal samples may not be as useful because they were taken from tissue that appeared histologically normal but could be physiologically influenced by the tumor microenvironment (as mentioned here for example in this paper by Buzdin et al., 2019, section 10).
I was therefore wondering if it was possible to leverage RNA sequencing datasets for normal tissue from other databases (e.g. GEO, GTEx, etc.). I am aware that any such attempt of combining data from multiple studies would be heavily influenced by technical effects and differences between platforms and experimental strategies. Specifically I would be looking to include only such datasets that were sequenced on the same platform as their TCGA counterparts (Illumina) using at least similar experimental protocols. From what I understand, this would stil make such an analysis exploratory at best and simply irrelevant at worst.
In addition to the above I have read about several different approaches of normalizing data to account for such effects, always keeping in mind that they are likely to limit the interpretability of the results. From my understanding, simply including the different sources of data as a batch variable in a design matrix for the dfferential gene expression (as proposed for example in this question) might not be feasible here as this would make the batch variable linearly dependant on the group variable (healthy vs tumor) (as outlined in this guide, section 7.4). Please correct me, should I be wrongly informed on this. One approach that stood out to me as possibly promising to combine these datasets more meaningfully was the approach detailed by Molania et. al, 2022 called RUV-III PRPS.
My question now - given my limited experience and also possible lack of knowledge about proper methods - is as follows: Given my goal to procure a list of differentially expressed genes for different tumor types individually, is there any potential at all in combining TCGA datasets with normal tissue samples from other studies? If so, what metrics could I employ to build confidence in list of genes obtained through such merged data?
I would like to apologize in advance, should any of the above contain obvious lack of understanding of proper methods. I am looking forward to your comments and would like to thank you for your time and effort reading this question.
Dear Ming Tommy Tang,
thank you for your input! I read the article you linked me to and found out about the UCSC's Toil Pipeline project which I was unaware of until now. Given they provide TCGA and GTEx data which was processed using the same pipeline that would surely help eliminate computation batch effects. Thank you very much for pointing me this way!
In the article you linked me to, the authors do not perform additional data correction steps after obtaining the initial count data and filtering for only protein-coding genes and I am wondering if that is truly feasible. As outlined in the paper by Molania et. al linked above, there are other sources of significant technical effects present in the TCGA data. Going back to my original question: Would the plausibility of a differential gene expression analysis be higher if an attempt was made to correct for at least the known sources of such variation before merging the datasets or would it be higher if no such attempt was made? I feel this is likely not something that can be answered generally, hence why I also asked of metrics one could employ to answer this question.