Question

Linking patient survival to gene-set analysis

1

Entering edit mode

7.8 years ago

mk ▴ 310

I have a bunch of HT-seq data from the TCGA and the accompanying clinical metadata. All the patients have cancer. At the end of the surveillance period, some of the patients are dead, others are not. Lets call the deceased patients group A and the surviving patients group B. Also I know when the patients who eventually died actually succumbed to their disease. I know that performing a KS-test for gene set enrichment using either gene set C and gene set D gives me a KS p-value << 0.00001 when the two groups of patients are A: alive at the end of surveillance and B: deceased at the end of surveillance. How to I investigate the hypothesis that pathway C and pathway D are acting together in the patients who do not survive surveillance?

As far as I can see, there are at least a couple approaches to take:

1) Do a KS-test on group A and B using gene set C. Repeat with gene set D. Conclude that pathway C and D are related and associated with patient death based on the algebraic mean of the two p-values.

2) Look only at the patients who died under surveillance and cluster them based on expression of Union(C, D), then compare the survival functions of the resulting clusters.

I'm not sure what packages are best to conduct this analysis, so suggestions are appreciated. Thanks!

Here is an updated strategy based on a recently published paper

0) Take all the patients in Group A

1) Run PCA on the FPKM-normalized counts for transcript IDs mapping to the symbols in Gene Set 1

2) denote PC1 as PC1-GS1

3) Run PCA on the FPKM-normalized counts for transcript IDs mapping to the symbols in Gene Set 2

4) denote PC1 as PC1-GS2

5) Run K-means clustering (let k=3) on patients in Group A, where each patient maps to a point in 2-space corresponding to PC1-GS1, PC2-GS2.

6) Plot survival curves for cluster 1:3

7) Perhaps high PC1-GS2, high PC1-GS1 correlates to worse survival than high PC1-GS2, low PC1-GS1 or low PC1-GS2, high PC1-GS1

Here is the issue:

All the normalized mRNA-seq data that I've downloaded from the GDC shows counts mapped to Ensembl transcript IDs, but the gene sets that I've downloaded from MSigDB in the form of Entrez Gene IDs!

There is apparently a way to map Ens Transcript IDs to Entrez GIDs using the R package biomaRt, however the query that I have written (sample below uses only 5 transcript IDs) is failing.

>library(biomaRt)
>ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
>getBM(attributes = c('entrezgene'), filters = 'ensembl_transcript_id', values = c("ENSG00000270112.3", "ENSG00000167578.15", "ENSG00000273842.1", "ENSG00000078237.5", "ENSG00000146083.10"), mart = ensembl)
[1] entrezgene
<0 rows> (or 0-length row.names)

I'm shocked that GDC, being an NCI entity doesn't use Entrez annotation for everything but such is life...

GSEA survival RNA-Seq TCGA • 3.0k views

ADD COMMENT • link updated 7.8 years ago by Kevin Blighe 89k • written 7.8 years ago by mk ▴ 310

score 4 · Accepted Answer · 2017-11-25

4

Entering edit mode

7.8 years ago

Kevin Blighe 89k

Hi, you asked me to respond from our other thread: C: GDC API query to map HT-seq FPKM UUID to a Case UUID

As I understand, you have two sets of genes and they both appear to be roughly equal in terms of statistical significance through Kaplan-Meier Survival with death as endpoint?

How did you arrive at these gene sets? - from KEGG or some other pathways database? I would just take the P value from survival and use that as an excellent starting point for the following:

How does the survival look with the gene sets combined?
What's the AUC of each gene set (and both combined) for predicting death through binary logistic regression and ROC analysis? - see A: Resources for gene signature creation
Can you refine each gene set (and both combined) via stepwise logistic regression or lasso-penalised regression inorder to pick the best predictors? - see A: Resources for gene signature creation and Multinomial elastic net implementation on microarray dataset , error at lognet and nfold explanation
How does a network plot with community structure identified for the gene sets combined look? - see Network plot from expression data in R using igraph and STRING

1) Do a KS-test on group A and B using gene set C. Repeat with gene set D. Conclude that pathway C and D are related and associated with patient death based on the algebraic mean of the two p-values.

Depends what you mean by 'related' (?). For example, 2 genes may show 0.999 Pearson correlation in a dataset of 1000 samples but may still have absolutely no relation to each other. You could say that they are related in terms of stats/numbers but not conclude any biological relationship.

2) Look only at the patients who died under surveillance and cluster them based on expression of Union(C, D), then compare the survival functions of the resulting clusters.

A simple cluster dendrogram with heatmap would be quite useful, yes. Based on survival, I'd asume some segregation of those who died versus those who didn't. See How to generate PCA plot and heatmaps in R and How to plot a heatmap with two different distance matrices for X and Y for ideas on heatmaps

ADD COMMENT • link 7.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, You responses prompted me to think a bit more about what exactly I am trying to show. In particular your suggestion about clustering led me to several important works that have already been published. I've come up with a new workflow that incorporates this type of analysis, but hit one significant snag.

ADD REPLY • link 7.8 years ago by mk ▴ 310

0

Entering edit mode

Hi, regarding the gene annotation, yes, it's a major pain in the neck and represents a barrier to conducting proper interpretation of results. We have gene HGNC symbols, like BRCA1, TP53, ATM, etc., but many people use alternative names for these, some of which are entirely different. Always going by unique numerical identifiers like Ensembl and Entrez helps.

The issue you have is that your transcripts are ensembl gene IDs, and also you have them as gene isoforms (indicated by the suffix at the end of each). If you run the following, you'll find matches (but not for all):

getBM(attributes=c('entrezgene', 'ensembl_gene_id'), filters='ensembl_gene_id', values = c("ENSG00000270112", "ENSG00000167578", "ENSG00000273842", "ENSG00000078237", "ENSG00000146083"), mart=ensembl, uniqueRows=TRUE)
  entrezgene ensembl_gene_id
1      57103 ENSG00000078237
2      22838 ENSG00000146083
3      53916 ENSG00000167578
4         NA ENSG00000270112

I'm not sure why it could not even find ENSG00000273842 in the Ensembl gene database. Also, not that ENSG00000270112 is not in Entrez

The other approach that you've just posted looks interesting. I recently published a clustering approach where I then noted statistically significant differences (related to vitamin D) between the three identified groups

ADD REPLY • link 7.8 years ago by Kevin Blighe 89k

0

Entering edit mode

This problem is due to retired ensembl ID's.

ADD REPLY • link 7.8 years ago by GenoMax 153k

0

Entering edit mode

Thanks genomax - useful to know

ADD REPLY • link 7.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin, removing the transcript suffix allowed me to map them to Entrez IDs as Ensembl genes.

ADD REPLY • link 7.7 years ago by mk ▴ 310

0

Entering edit mode

Great. Please create a new thread/question if you run into any other issues further along the line. It's good to get diverse opinions. Together we are stronger!

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k