Question

How to select top ranking genes with variance from a time point experiment using R or Excel

0

Entering edit mode

9.5 years ago

herman.pappoe.45 ▴ 10

Hello everyone,

I have RNA seq data of human cardiomyocyte samples collected at 5 different time points of the development of the cells (i.e. Day0, Day2, Day5, Day15, Day 30). The model is hence a directed differentiation system. I am using a file with normalized RPM counts for each transcript ID from a previous transcriptome quantification step(with Cufflinks). I eventually plan on "grepping" these transcript IDs to the corresponding Gene_IDs. What I essentially have is a matrix with cuff.IDs and gene expression values for 5 columns representing the time points. I want to essentially build a gene regulatory network that encapsulates the differentiation process in our cardiomyocyte samples. I want to use genes that are constantly differentially expressed throughout the differentiation time-points. I was thinking about approaching this by running a differentially expressed gene analysis of each time point in development against Day0, sort of using Day 0 as the control. I would then select those genes that remain differentially expressed in all comparisons Day0-2, Day0-5, Day0-15, Day0-30. My intention was to perhaps rerun DESEQ2 in R in this manner. However, when I mentioned this idea to my PI, I was told that I could instead approach the matter by calculating the covariance among the samples and then ranking the genes and selecting the top few genes using EXCEL. I have no idea how to approach this using EXCEL. I am completely inexperienced in bioinformatics, programming, statistics and I barely used a PC until 5 months ago. I would appreciate it if I could get a step by step tutorial to how approach my issue using EXCEL for my specific project. I am aware there are many tutorials out there but none are clear and are rather causing more confusion for me. For example when I calculate the covariance among two lists of genes it results in only one value. What can I do with this covariance value in excel, in order to successfully rank the genes by covariance?

My supervisor instructed me to use R to get these results. However, I am terrible with R. I cannot even figure out which function to use to read the file. read.table is giving some issues. This is the command line that my supervisor advised to use to obtain variance from list:

topVarGenes <- head(order(rowVars(data[,2:6]),decreasing=TRUE),15)
gene_lists <- cbind(data[topVarGenes,], rowVars(data[topVarGenes,2:6]))
write.table(gene_lists,file='topVarGenes.txt',quote=FALSE,sep="\t")
###So the rowvars are calculating the covariance and order and ranking them.

The above is just not working. I think it might have to do with how I loaded the data, but I am so inexperienced in R, I am not certain what the issue is. I am speculating maybe it should be data.frame. It would be much obliged if I could get a step by step R command line to get the results I need.

Also, if I wanted to instead run a coVariance against Day 0 for all samples how would I modify the command line?

I know I have asked a lot of questions and I am very grateful in advance to whoever takes the time to respond.

time-point variance RNA-Seq R excel • 5.3k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 9.5 years ago by herman.pappoe.45 ▴ 10

0

Entering edit mode

Please can you add sample data and the output.

ADD REPLY • link 9.5 years ago by lmanohara99 ▴ 20

0

Entering edit mode

Yes I can, which output are you referring to?

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 9.5 years ago by herman.pappoe.45 ▴ 10

Ram · Answer 1 · 2016-02-14

This is what the excel file is like. By using the covariance function in excel for Day 0 and Day 2 for example, I get a single resulting value. I simply selected the covariance.s function and highlighted the cells for Day0 as the first array and the cells for Day 2 as the second array and clicked done. In a separate cell I get one resulting value from this function. Besides the fact I get an error saying Formula Omits adjacent cells, I am not clear on how to use this covariance.s function in excel to rank all the transcript IDs. What should I do with the resulting value? Or is there a better approach to rank the "genes" (a.k.a. cuff.IDs/ transcripts) by covariance using excel?

            DAY 00          DAY 02          DAY 05          DAY 15          DAY30
            0               2               5               15              30
CUFF.ID     0               0               2.297569688     0.876671707     4.140347772
CUFF.ID     2.626527804     0               9.19027875      8.766717072     4.140347772
CUFF.ID     330.9425034     209.1708523     785.7688332     642.6003614     785.6309897
CUFF.ID     799.7777164     440.7528674     1553.922965     1551.708922     2158.156276
CUFF.ID     0               0               0               0               0
CUFF.ID     0               1.067198226     1.531713125     14.02674732     8.280695543
CUFF.ID     1.313263902     0               4.595139375     2.630015122     2.070173886
CUFF.ID     2.626527804     2.134396452     0.765856563     0               0
CUFF.ID     5540.660403     4782.115251     4170.85484      3401.486224     3413.716738
CUFF.ID     23.63875024     34.15034324     13.78541813     23.67013609     19.66665192

I was also trying to use R to get the same results. I was sent a sample command line from my supervisor to help me with the computation. This is what the file uploaded in R looks like:

> head(data)
                   V2         V3          V4           V5          V6
CUFF.1       0.000000   0.000000    2.297570    0.8766717    4.140348
CUFF.10      2.626528   0.000000    9.190279    8.7667171    4.140348
CUFF.10000 330.942503 209.170852  785.768833  642.6003614  785.630990
CUFF.10001 799.777716 440.752867 1553.922965 1551.7089220 2158.156276
CUFF.10002   0.000000   0.000000    0.000000    0.0000000    0.000000
CUFF.10007   0.000000   1.067198    1.531713   14.0267473    8.280696

It was loaded as read.table: data <- read.table("file_1")

I installed matrixStats as a package to perform the following commands with no avail:

> topVarGenes <- head(order(rowVars(data[,2:6]),decreasing=TRUE),15)
Error in head(order(rowVars(data[, 2:6]), decreasing = TRUE), 15) : 
  error in evaluating the argument 'x' in selecting a method for function 'head': Error in `[.data.frame`(data, , 2:6) : undefined columns selected

> gene_lists <- cbind(data[topVarGenes,], rowVars(data[topVarGenes,2:6]))
Error in `[.data.frame`(data, topVarGenes, ) : 
  object 'topVarGenes' not found

> write.table(gene_lists,file='topVarGenes.txt',quote=FALSE,sep="\t")
Error in is.data.frame(x) : object 'gene_lists' not found

**I should add that my supervisor also reformatted the data file so that the Cuff.ID column would be row.names or something of the sort.

I have 2 important concerns:

I do not understand how this function is computing covariance. Is it pairing adjacent columns?
I want to be able to calculate the covariance and rank the genes ALL against DAY0. Hence Day0-2, Day0-5, Day0-15, Day0-30. Can this function do that? Is there a better approach to my idea?

Thank you very much for your response!