Hello All,
I have retrieved the data matrix from TCGA breast invasive carcinoma (BRCA) - expression data, The data is Level_3 Data (file names: *.rsem.genes.normalized_results
) downloaded from TCGA DCC, log2(x+1) transformed, and processed data
I have quantile normalized log2-transformed RPKM data from TCGA and I wanted to find DE genes based on B statistic and log2foldchange.
Sample data is as follows:
Normalised data matrix for 113 Normal Vs 60 Breast cancer patients
Gene_ID TCGA-A7-A0CE-11 TCGA-BH-A0AY-11
ARHGEF10L 9.998 9.768
HIF3A 6.1594 7.7241
RNF17 4.1813 0
RNF10 11.7961 11.7325
I have quantile normalized log2-transformed RPKM data from TCGA and I wanted to find DE genes based on B statistic and log2foldchange.
I'm using the following R-script:
library(limma)
raw.data <- read.delim("INPUT-QuantileNormalizedLog2Transformed-RPKM-Data.txt")
attach(raw.data)
names(raw.data)
d <- raw.data[, 2:5]
rownames(d) <- raw.data[, 1]
##To design matrix---
Group<-factor(pheno$Status,levels=levels(pheno$Status))
design<-model.matrix(~0+Group)
###Assigning colnames###
colnames(design)<-c("Control", "Tumor")
###To assign the designed matrix in linear model using limma
fit <-lmFit(eset,design)
###Designing contrast matrix#
cont.wt<-makeContrasts("Tumor-Control",levels=design)
fit2 <-contrasts.fit(fit,cont.wt)
## eBayes step for calculating p-values, fold change etc###
fit3<-eBayes(fit2)
Example of my Pheno File:
Sample Status pheno
TCGA-A1-A0SN-01 Tumor 1
TCGA-A8-A08S-01 Tumor 1
TCGA-AC-A23C-01 Tumor 1
TCGA-AQ-A0Y5-01 Tumor 1
TCGA-AQ-A1H2-01 Tumor 1
TCGA-A7-A0CE-11 Control 0
TCGA-BH-A0AY-11 Control 0
TCGA-A7-A0CH-11 Control 0
TCGA-BH-A0BW-11 Control 0
TCGA-BH-A208-11 Control 0
sessionInfo()
R version 3.2.0 beta (2015-04-01 r68134) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: CentOS release 6.6 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] limma_3.24.3 affy_1.46.0 Biobase_2.28.0
[4] BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0 BiocInstaller_1.18.1 tools_3.2.0
[4] affyio_1.36.0 preprocessCore_1.30.0
I am getting the following output:
GENE_SYMBOL coefficients df.residual sigma stdev.unscaled Amean s2.post t df.total p.value lods F F.p.value
ARHGEF10L -0.608719705 171 0.495810737 0.159737986 9.65395 0.244892 -7.70056 172.05824 1.02E-12 17.86426 59.29856 1.02E-12
HIF3A -5.727480133 171 0.495810737 0.159737986 9.65395 0.244892 -7.70056 172.05824 1.02E-12 17.86426 59.29856 1.02E-12
My Concern/Question:
- Can someone please let me know what I'm doing is correct procedure for finding out Differential gene expression analysis for two Groups (control Vs Normal in this case) or where I am going wrong here.
- I am bit confused about the Out put here, I am failing to sort data. Based on what stats should I take cut off for genes which are differentially Expressed (please have a look at the output, how to sort the data?).
Thanks a ton for your kind help
-Ateeq Khaliq
I am aware that this thread's been dormant for nearly 2 years but I've been dropped into something of a similar nature and found it interesting. In their user guide the authors of the limma package say: "In the limma approach to RNA-seq, read counts are converted to log2-counts-per-million (logCPM) and the mean-variance relationship is modelled either with precision weights or with an empirical Bayes prior trend. The precision weights approach is called “voom” and the prior trend approach is called “limma-trend”. " It would be my understanding that RSEM values could be used as well since they're based on log-normalised count values - or am I mistaken?