Question

Performing Anova Test For Ngs Time Series Datasets

0

Entering edit mode

10.7 years ago

bioinforupesh2009.au ▴ 140

Dear all,

Maybe its very basic question for you all.

I would like to perform 2 way ANOVA test for my NGS time series counts data. It is raw data counts by HT-Seq. I have 2 condition and for each corresponding diff. time points and their counts. Suggestion is to perform 2 way anova test. but problem is, I don't know the better way to prepare file for that in order to use R function like aov(). Help me to do so....

My raw counts (without normalization) like..

gene_name counts_value (for each condition and corresponding time points (36 sample in total) )

gene_1  2349
gene_2  3462380
gene_3  0
gene_4  123

Query:

How to prepare (format of input) input file in order to use R functions
Is there any package (bioconductor or R) to perform this analysis
May I have to normalize this raw counts before performing 2 way ANOVA

microarray bioconductor • 5.3k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 10.7 years ago by bioinforupesh2009.au ▴ 140

Ram · Answer 1 · 2014-03-04

4

Entering edit mode

10.7 years ago

Devon Ryan 104k

Given that these are raw counts, you really really don't want to use aov() or any other Anova test. What you want is to use a generalized linear model, which will respect the fact that these are integers rather than floats. Please have a read through the DESeq2 vignette, as that has facilities to directly deal with counts produced by htseq-count and is, in fact, written by the same authors.

ADD COMMENT • link updated 20 months ago by Ram 44k • written 10.7 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you sir for your kind advice but before analyzing differential expression gene analysis, I just wanted to use 2 way ANOVA test in order to investigate and model the relationship between a response variable and predictor variables. Just short of preliminary analysis.

and About DESeq2, indeed i am used to with this amazing package, not for time series analysis but for another standardized experiments (2 conditions and their replicates).

ADD REPLY • link 10.7 years ago by bioinforupesh2009.au ▴ 140

3

Entering edit mode

You still don't want an ANOVA then. ANOVA's require continuous variables, which you don't have (you have integer counts). You could read.delim() the count files in, cbind() those to create a matrix (make sure to remove the last 5 lines) and then use glm.nb() from MASS. You could instead use a glmm if you really wanted. Either of those would be better ideas than an ANOVA.

ADD REPLY • link updated 20 months ago by Ram 44k • written 10.7 years ago by Devon Ryan 104k

Ram · Answer 2 · 2014-03-04

2

Entering edit mode

10.7 years ago

Charles Warden 8.3k

I agree with dpryan79's concern about counts, but I think your strategy should be OK if you are working with normalized expression (like RPKM/FPKM). Once you use a tool like cufflinks, RSEM, etc., you should be OK. I think you can also use DESeq to produce normalized expression values, but that's not typically what I do. This is similar to what I do in the sRAP package:

http://www.bioconductor.org/packages/release/bioc/html/sRAP.html

If your institution has a license, you could use the RSEM-like mRNA quantification tool in Partek and pretty much follow the same workflow as described in this paper (except the pairing variable specifies time points instead of tumor-normal pairs):

http://bioinfo.aizeonpublishers.net/content/2013/6/285-292.html

That said, your ANOVA-based design will be asking what genes consistently change across time points. If you are trying to ask a different sort of question, you may want to take a look at some of the tools available in the timecourse package:

http://www.bioconductor.org/packages/2.12/bioc/html/timecourse.html

ADD COMMENT • link updated 20 months ago by Ram 44k • written 10.7 years ago by Charles Warden 8.3k

0

Entering edit mode

Thank your sir for your kind suggestion to use diff packages in R. i will surely check. But for your first para i can say, i have raw counts from directly after mapping. Moreover, this is not a RNAseq instead miRNA short reads. So i am not suppose to use RPKM/FPKM normalized expression counts. However, i am still confused that may i have to normalization of this raw counts before ANOVA or Negative Binomial Generalized Linear Model (as suggested by dpryan79).

Thank you once again for your time and advice

ADD REPLY • link 10.7 years ago by bioinforupesh2009.au ▴ 140

0

Entering edit mode

You don't need to normalize for gene length, but you do need to normalize for the number of reads per library.

For miRNA-Seq data, I would use CPM (count per million reads).

ADD REPLY • link 10.7 years ago by Charles Warden 8.3k