Question

analysis of association of expression data and tumor phenotypes

0

Entering edit mode

7.0 years ago

afzaljf911 ▴ 20

Dear all, I have raw count, RPKM, log2 RPKM, RSEM and log2 RSEM of RNAseq data. I want to know which one is usable for association analysis? for example I have 2 sub-types of tumor and I am going to compare expression of p53 gene between them. Which count is acceptable for this analysis? Can I analyse expression data with independent sample T test? I would be very grateful if you help me to do it in the best way. Regards,

RNA-Seq • 1.5k views

ADD COMMENT • link updated 7.0 years ago by Kevin Blighe 89k • written 7.0 years ago by afzaljf911 ▴ 20

score 1 · Answer 1 · 2018-05-19

1

Entering edit mode

7.0 years ago

Kevin Blighe 89k

RPKM is not appropriate for cross-sample comparisons in terms of differential expression analysis. The 'raw' RSEM counts are (appropriate).

Use the raw RSEM counts and input these into DESeq2 using tximport for normalisation. There is a helpful tutorial available here: Transcript abundance files and tximport input.

If you cannot get that running, then you just need to extract the raw_count values from the RSEM files, round them up/down to integer values, merge these into a data-frame, and then use these as input into DESeq2 with DESeqDataSetFromMatrix()

Everything that you need is in the tutorial to which I pointed you.

Kevin

ADD COMMENT • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you Kenin, Can I use the raw RSEM for the basic statistical analysis by SPSS? and also How can I categorize it as a low and high level? is there any cut off? or I shoud use mean of expression?

ADD REPLY • link 7.0 years ago by afzaljf911 ▴ 20

1

Entering edit mode

SPSS cannot normalise the raw counts, as far as I know (?). In order to conduct 'faithful' differential expression analyses, you will have to normalise the raw counts.

What is your sample size? - just 2 samples?

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

1

Entering edit mode

Also, can you elaborate on where you obtained the data? If you have minimal experience with normalising RNA-seq data, then your best option may be to use SPSS on the log2 RSEM counts. I can only assume that these are at least normalised.

If your sample size is only 2, though, then you can only calculate ratios between your 2 samples and not convincingly calculate any p-values. For example, GeneX is 2 fold higher in tumour sub-type 1 compared to sub-type 2, etc.

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you very much. I have the TCGA data on breast cancer. I got Log2 RSEM from firebrowse. In fact I am going to compare the level of expression of p53 between luminal and TNBC types and also between ER positive and ER negative tumors (Is there any difference in level od p53 expression between 2 roups?). The data drived from Realtime-PCR is considered by Log2delta-ct. One of my coworkers says log2 RPKM is same as log2 dela-ct. so for analysing association of expression level with type of tumor or receptor status, we can use it.I am not sure about RPKM. I am not expert in R program and its pakage, so I want to use SPSS. But I am going to be sure about the data. if is not possible to do it with SPSS, I should do it with R pakages. Do you have any idea about it?

ADD REPLY • link 7.0 years ago by afzaljf911 ▴ 20

1

Entering edit mode

I'm not sure of the validity of stating that log2 RPKM is the same as log2 delta-Ct - apologies to your colleague. In fact, Ct values are already measured on the logarithmic scale, so, logging them again with log2 is seemingly unnecessary. The normalisation method in both methods is also entirely different...

As you obtained RSEM from FireBrowse, then it should already be normalised. What you could do is check the distribution of the log2 RSEM data in order to see if it fits the binomial distribution (the 'bell curve'). This can be done by just generating a histogram from the data (in R, hist() function). If it looks to have a normal distribution, then you can justify the use of parametric statistical tests to compare between groups.

If you must use SPSS, then you can most likely justify it. My preference would be to just get the raw RSEM counts from the TCGA GDC Data Portal and then process those in EdgeR or DESeq2, but this requires some experience in using these program and also in R. Where are you aiming to publish the work? Is it a major part of your project or do you just want some p-values as a sort of small validation of other work?

For what it's worth, my PhD was in breast cancer and I believe that TP53 expression would be higher in TNBC than other types. I published an obscure study years ago that hypothesised how, although the TNBC sub-types can start as ER-positive/luminal, they eventually lose their ER expression due to heightened TP53 expression, which pushes the cells through vast numbers of cell-cycles and depletes the stem cell compartments. I hypothesised that TP53 is higher if these cells are defective in double-strand break repair, which explains why BRCA1 carriers are more likely to be TNBC. In this way, ATM may also be higher.

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you very much dear Kevin. It is a part of my thesis and I want it for publication. So I need to be sure about the data. May be I need to learn about R and its pakages. I am going to analyse several genes as well as p53. Is there any easy and fast way to learn? Or I need a training to learn?

ADD REPLY • link 7.0 years ago by afzaljf911 ▴ 20

1

Entering edit mode

There are lots of tutorials online. I also have some very basic tutorials on my GitHub page: https://github.com/kevinblighe/Rtutorials

Once you feel comfortable, you can come back here to post new questions about specific topics, of course. Another good place is the Bioconductor Support forum