I have transcript's TPM expression value data of cancer and normal samples in all stages of BRCA. The data is rna-seq taken from TCGA. I want to calculate the p-value i.e. significance level among normal and cancer samples. Can I apply student t test for the purpose?
Total cancer samples = 1000
Total normal samples = 1100
Ok, thanks for your reply.
I want to know that if I have expression values of different transcripts, and if I have to assign gene id/symbol to each of the transcript, than in that case what will be expression level of gene? Will it be average of different transcripts belong to that particular gene or it will be the expression of transcript with highest value.
Is there any reference or paper regarding that?
The data here shows the transcript id, expression in cancer cells(TPM), expression in normal cells(TPM) and the related gene. So, I have to assign expression value to gene, then what will be expression of ABCA10, should I do average of all transcripts expression or I have to consider the transcript with highest expression.
Please suggest.
If you have no other information apart from this, such as gene length, then you may obtain the average expression for each gene. It is not ideal, but may be the best option given the data that you've got.
Wait, actually, your data is already normalised, so, it may already have been adjusted for gene lengths. If you want to look further into it, tximport and DESeq2 take gene lengths into account for the purposes of normalisation. In your situation, it may very well be sufficient to just obtain the mean expression over the transcript isoforms. Otherwise, you could download the raw counts from TCGA and re-process them, but this could waste a lot of time.
Your data has already been heavily processed and is merely a summarisation (likely by mean) of the expression of tumour and normal samples. So, now your entire numerical data is just 2 columns: one for tumor; one for normal.
Unless you have the original data from which these were generated, there's not much else you can do other than summarise over the transcript-isofoms by median or mean. If you wanted to go back and ensure that transcript lengths were taken into account, then you'd have to obtain the original data, which would encompass raw or estimated counts over all samples.
Chances are that you didn't obtain this data direct from the TCGA, as the TCGA does not output data in that highly-summaised format. Third parties that use and re-process the TCGA data do output data in this format that you have, though. Check the exact data processing methods at the source from which you obtained this data.
If you have absolutely nothing else other than this data and don't know how to obtain the original raw data and re-process it faithfully, then perhaps the best that you can do is indeed a Mann-Whitney test (non-parametric t-test). It's really not ideal, though, and it's annoying that third parties would output data like this for end-users.
Best not to, considering that the distribution of your data is likely not that expected by a Student's t-test.
Why not use a program dedicated to performing differential expression analysis? - obtain the RSEM raw counts for BRCA and process them through EdgeR or DESeq2. Then, you can obtain more 'credible' statistics.
I have already computed the stats for BRCA for both matched T-N and unmatched T-N, if you want.
Related thread: A: Normalization of TCGA RNA-seq data (TPM) in R
Ok, thanks for your reply. I want to know that if I have expression values of different transcripts, and if I have to assign gene id/symbol to each of the transcript, than in that case what will be expression level of gene? Will it be average of different transcripts belong to that particular gene or it will be the expression of transcript with highest value. Is there any reference or paper regarding that?
Do you mean different transcript isoforms? You can paste some examples of your data, if you want (?)
Ya sure The example data is:
uc010dfa.1 0.1056 2.1825 ABCA10
uc002jhz.2 0.251 4.4502 ABCA10
uc010wqs.1 0.3977 5.294 ABCA10
uc010dfb.1 0.0484 0.5487 ABCA10
uc010dfc.1 0.0976 1.1013 ABCA10
uc010wqt.1 0.0743 0.386 ABCA10
The data here shows the transcript id, expression in cancer cells(TPM), expression in normal cells(TPM) and the related gene. So, I have to assign expression value to gene, then what will be expression of ABCA10, should I do average of all transcripts expression or I have to consider the transcript with highest expression. Please suggest.
If you have no other information apart from this, such as gene length, then you may obtain the average expression for each gene. It is not ideal, but may be the best option given the data that you've got.
Ok, if there is gene length also, then what would be the method to calculate gene expression?
Wait, actually, your data is already normalised, so, it may already have been adjusted for gene lengths. If you want to look further into it, tximport and DESeq2 take gene lengths into account for the purposes of normalisation. In your situation, it may very well be sufficient to just obtain the mean expression over the transcript isoforms. Otherwise, you could download the raw counts from TCGA and re-process them, but this could waste a lot of time.
Is there any reference quoting this?
Kevin Blighe is a reference.
Actually, I want to know for citing purpose.
Devon Ryan is a reference, too (I asked him).
Your data has already been heavily processed and is merely a summarisation (likely by mean) of the expression of tumour and normal samples. So, now your entire numerical data is just 2 columns: one for tumor; one for normal.
Unless you have the original data from which these were generated, there's not much else you can do other than summarise over the transcript-isofoms by median or mean. If you wanted to go back and ensure that transcript lengths were taken into account, then you'd have to obtain the original data, which would encompass raw or estimated counts over all samples.
Chances are that you didn't obtain this data direct from the TCGA, as the TCGA does not output data in that highly-summaised format. Third parties that use and re-process the TCGA data do output data in this format that you have, though. Check the exact data processing methods at the source from which you obtained this data.
If you have absolutely nothing else other than this data and don't know how to obtain the original raw data and re-process it faithfully, then perhaps the best that you can do is indeed a Mann-Whitney test (non-parametric t-test). It's really not ideal, though, and it's annoying that third parties would output data like this for end-users.