Did anyone use ESTIMATE (http://bioinformatics.mdanderson.org/main/ESTIMATE:Overview) to infer tumor purity and stromal score from RNA-seq before? I am not clear how to use this tool and what is the input file format for this tool? They are just several steps, and i did not figure out how to load my own data to run the program? Thanks very much for your great help.
Dear Ihaiyan3,
Could you figure out how to load your own data to run the program? It is not clear what the input file should be for this tool!
I appreciate your help and time!
The ESTIMATE algorithm (Yoshihara et al. 2013 Nature Communications) is comprised of two steps. In the first step, an enrichment score is calculated using single-sample GSEA (Barbie et al. 2009 Nature). Note that although immune cells are essentially part of the stroma, Yoshihara et al. calculated two enrichment scores. One based on immune-related genes, which they referred to as "immune" score. The other score was calculated based on non-immune genes, which they referred to as "stromal" score. The final ESTIMATE score is the sum of immune and stromal enrichment scores. In the second step, the ESTIMATE enrichment score is converted to tumor purity using the following formula:
Tumour purity = cos (0.6049872018 + 0.0001467884 x ESTIMATE score)
where "Tumor purity" represents ABSOLUTE-based tumor purity (ABSOLUTE is another algorithm that computes tumor purity based on somatic DNA copy number alterations), and "ESTIMATE score" represents ESTIMATE enrichment score obtained from TCGA Affymetrix data, as explained above. The key point is that this calibration formula was derived using only Affymetrix data, and therefore cannot be used to convert RNAseq-based ESTIMATE score to tumor purity. That being said, you may still apply the single-sample GSEA algorithm to properly normalized RNAseq data to obtain ESTIMATE enrichment scores, and incorporate them as covariate in your downstream analysis to account for tumor purity.
"The key point is that this calibration formula was derived using only Affymetrix data, and therefore cannot be used to convert RNAseq-based ESTIMATE score to tumor purity" ... How does this not answer the question?
First of all, "as this was done by X" is rarely the right approach to verify assumptions of a computational algorithm. Second of all, ESTIMATE is published and the R code is publicly available for anyone to review. The ESTIMATE R package by default only accepts "affymetrix", "agilent", or "illumina" microarray data as input. Can you feed normalized RNAseq data as input to ESTIMATE? You surely can! ESTIMATE uses single sample GSEA to compute immune and stromal scores; it then adds them up to get ESTIMATE score which one can use for downstream analyses. In fact, this is what is provided on their website for TCGA RNAseq data. However, you can’t apply these scores to their formula to calculate tumor purity as this formula was derived specifically for microarray data.
I vaguely remember the opnion that statistical method developed from array data is not suited on RNA-Seq and this has something to do with the nature of RNA-Seq being zero-sum game (total reads sequenced is fixed). But I could not remember the details. Can you explain this a bit in details? Thanks
Why does the ESTIMATE score differ for the same TCGA ID between certain datasets? For example, between Yoshihara (2013) versus Aran (2015) record ESTIMATE for TCGA-BL-A3JM as -1365.01 versus 0.9193 respectively. Both use RNASeqV2.
Is this reflection of differing calculation methods? What is the true ESTIMATE score used to calculate purity?
It may change with data set. Scores are calculated with ssGSEA. It do caclculations witihn the sample (sample by sample manner without considering the all samples as background if I am not confused). But even for example they used tpm data from TCGA, if one group is putting all genes but the other groups have input tpm where lowly expressed values are remove, the rank of the genes per each sample changes.So it definitely affect the result. It would be helpful to understand pre-processing steps of RNA-seq data before giving as input.
Dear Ihaiyan3, Could you figure out how to load your own data to run the program? It is not clear what the input file should be for this tool! I appreciate your help and time!
@Haiyan Lei and @Raheleh, did you figure out how to use our own data to run the program? Please update if you have managed.