I have a bunch of HT-seq data from the TCGA and the accompanying clinical metadata. All the patients have cancer. At the end of the surveillance period, some of the patients are dead, others are not. Lets call the deceased patients group A and the surviving patients group B. Also I know when the patients who eventually died actually succumbed to their disease. I know that performing a KS-test for gene set enrichment using either gene set C and gene set D gives me a KS p-value << 0.00001 when the two groups of patients are A: alive at the end of surveillance and B: deceased at the end of surveillance. How to I investigate the hypothesis that pathway C and pathway D are acting together in the patients who do not survive surveillance?
As far as I can see, there are at least a couple approaches to take:
1) Do a KS-test on group A and B using gene set C. Repeat with gene set D. Conclude that pathway C and D are related and associated with patient death based on the algebraic mean of the two p-values.
2) Look only at the patients who died under surveillance and cluster them based on expression of Union(C, D), then compare the survival functions of the resulting clusters.
I'm not sure what packages are best to conduct this analysis, so suggestions are appreciated. Thanks!
Here is an updated strategy based on a recently published paper
0) Take all the patients in Group A
1) Run PCA on the FPKM-normalized counts for transcript IDs mapping to the symbols in Gene Set 1
2) denote PC1 as PC1-GS1
3) Run PCA on the FPKM-normalized counts for transcript IDs mapping to the symbols in Gene Set 2
4) denote PC1 as PC1-GS2
5) Run K-means clustering (let k=3) on patients in Group A, where each patient maps to a point in 2-space corresponding to PC1-GS1, PC2-GS2.
6) Plot survival curves for cluster 1:3
7) Perhaps high PC1-GS2, high PC1-GS1 correlates to worse survival than high PC1-GS2, low PC1-GS1 or low PC1-GS2, high PC1-GS1
Here is the issue:
All the normalized mRNA-seq data that I've downloaded from the GDC shows counts mapped to Ensembl transcript IDs, but the gene sets that I've downloaded from MSigDB in the form of Entrez Gene IDs!
There is apparently a way to map Ens Transcript IDs to Entrez GIDs using the R package biomaRt, however the query that I have written (sample below uses only 5 transcript IDs) is failing.
>library(biomaRt)
>ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
>getBM(attributes = c('entrezgene'), filters = 'ensembl_transcript_id', values = c("ENSG00000270112.3", "ENSG00000167578.15", "ENSG00000273842.1", "ENSG00000078237.5", "ENSG00000146083.10"), mart = ensembl)
[1] entrezgene
<0 rows> (or 0-length row.names)
I'm shocked that GDC, being an NCI entity doesn't use Entrez annotation for everything but such is life...
Hi Kevin, You responses prompted me to think a bit more about what exactly I am trying to show. In particular your suggestion about clustering led me to several important works that have already been published. I've come up with a new workflow that incorporates this type of analysis, but hit one significant snag.
Hi, regarding the gene annotation, yes, it's a major pain in the neck and represents a barrier to conducting proper interpretation of results. We have gene HGNC symbols, like BRCA1, TP53, ATM, etc., but many people use alternative names for these, some of which are entirely different. Always going by unique numerical identifiers like Ensembl and Entrez helps.
The issue you have is that your transcripts are ensembl gene IDs, and also you have them as gene isoforms (indicated by the suffix at the end of each). If you run the following, you'll find matches (but not for all):
I'm not sure why it could not even find ENSG00000273842 in the Ensembl gene database. Also, not that ENSG00000270112 is not in Entrez
The other approach that you've just posted looks interesting. I recently published a clustering approach where I then noted statistically significant differences (related to vitamin D) between the three identified groups
This problem is due to retired ensembl ID's.
Thanks genomax - useful to know
Thanks Kevin, removing the transcript suffix allowed me to map them to Entrez IDs as Ensembl genes.
Great. Please create a new thread/question if you run into any other issues further along the line. It's good to get diverse opinions. Together we are stronger!