FunPat Differential Testing Plot Error
2
0
Entering edit mode
9.4 years ago
cacampbell ▴ 60

Some background / why FunPat. It seemed like fun, so I went through the trouble of formatting data and collapsing replicates on a data set that I have. FunPat seems like a good way to analyze patterns in data without purely doing a timepoint-by-timepoint differential expression test. It also seems to handle time series data in a way that is fast and easy, but not as intensive as other methods. So I would like to make this work.

Going through the vignette... I ran into some trouble at the first FunPat dependent test...

rank.res <- SEL.TS.AREA(replicates=data.replicates, data1=data.east.robust, data2=data.west.robust)
Error in plot.window(...) : need finite 'xlim' values

​Generally, I understand that this means there is a problem with the data that is being plotted -- perhaps that the x-values are all NA (not the case here) or that the data has somewhere been rounded to Inf / -Inf. However, this isn't extremely helpful for debugging, since it occurs within the FunPat package. If anyone here knows anything about this package, then experience with running it successfully is appreciated.

replicates:

> head(data.replicates) 
     [,1]         [,2]
[1,] "HA2015_E_1" "HA2015_W_1"
[2,] "HA2015_E_2" "HA2015_W_2"
[3,] "HA2015_E_3" "HA2015_W_3"
[4,] "HA2015_E_4" "HA2015_W_4"
[5,] "HA2015_E_5" "HA2015_W_5"
[6,] "HA2015_E_6" "HA2015_W_6"

> typeof(data.replicates)
[1] "character"

Conditions are E vs W and time points run 1-12 -- where each column that these represent are the mean of the TMM normalized, variance stabilized counts in 6 biological replicates. For example, HA2015_E_1 is the mean of normalized data in original columns / timepoints HA2015_E_1A, HA2015_E_1B, ... , HA2015_E_1F. After normalization, the data is filtered so that each gene with counts per million < 10 in < 3 total time points is removed. To make the final data points, I multiplied the normalized, stabilized cpm data (all of this is from edgeR) by 1e6

> head(data.east.robust)
                  HA2015_E_1 HA2015_E_2 HA2015_E_3 HA2015_E_4 HA2015_E_5
cp_gi_88656873.10    7515453    9725227   10727302   10255309    8591035
cp_gi_88656873.15   17622543   17279002   15038259   15230942   11827528
cp_gi_88656873.31   33014585   38909593   29762208   37263689   32190635
cp_gi_88656873.34   44122567   42952855   40310644   40730033   28628665
cp_gi_88656873.35   54416054   43932669   50318208   45582272   25039442
cp_gi_88656873.36  421778142  354396518  420075052  366674208  202177958
                  HA2015_E_6 HA2015_E_7 HA2015_E_8 HA2015_E_9 HA2015_E_10
cp_gi_88656873.10    9829677    9649200    8621957    5866552     6220681
cp_gi_88656873.15   13451452   14633426   14722655   17123265    15943894
cp_gi_88656873.31   34357137   38535700   36424192   37357025    31996346
cp_gi_88656873.34   27319291   27439998   37216480   37828356    27408666
cp_gi_88656873.35   20091613   21851825   38068246   39695883    27843067
cp_gi_88656873.36  169344826  182818905  291835384  297642154   219120517
                  HA2015_E_11 HA2015_E_12
cp_gi_88656873.10     9500501     7699961
cp_gi_88656873.15    21478783    16874638
cp_gi_88656873.31    45934377    36182478
cp_gi_88656873.34    58278341    53139325
cp_gi_88656873.35    57475077    63319437
cp_gi_88656873.36   447381294   495082542

> head(data.west.robust)
                  HA2015_W_1 HA2015_W_2 HA2015_W_3 HA2015_W_4 HA2015_W_5
cp_gi_88656873.10    8705507   10493190   10624282   10662117    9297593
cp_gi_88656873.15   20857781   18764049   17357275   16000770   12497117
cp_gi_88656873.31   38125384   41970061   32225874   34404035   37216502
cp_gi_88656873.34   60130445   43403453   40908678   33370083   30595589
cp_gi_88656873.35   63256459   46576562   50863253   48691533   28915060
cp_gi_88656873.36  529199609  371688856  441140850  409858560  242264349
                  HA2015_W_6 HA2015_W_7 HA2015_W_8 HA2015_W_9 HA2015_W_10
cp_gi_88656873.10    9688969   11052243    9525094    9070560     6051227
cp_gi_88656873.15   14216534   17881352   18182641   22626263    14543754
cp_gi_88656873.31   36686429   46805026   39827875   46339836    38197875
cp_gi_88656873.34   29431034   34863551   41118305   46427171    39922114
cp_gi_88656873.35   22425231   28769374   37997640   42338485    47715828
cp_gi_88656873.36  179806405  226524764  284215981  344190180   345058873
                  HA2015_W_11 HA2015_W_12
cp_gi_88656873.10     8564632     6967818
cp_gi_88656873.15    20864919    16855964
cp_gi_88656873.31    37395339    33958158
cp_gi_88656873.34    45881738    42315958
cp_gi_88656873.35    46794660    54719944
cp_gi_88656873.36   391511575   391274551
FunPat R rna-seq • 2.2k views
ADD COMMENT
1
Entering edit mode
9.4 years ago
cacampbell ▴ 60

For anyone that lands here from google, I did manage to figure out the problem in my case. I assigned replicates before doing normalization. Replicates for the FunPat package are identified by giving it the first value of the columns for each replicate as columns in the replicates structure. So, with names that did not match the actual replicates, I was essentially trying to plot nonsense.

In summary, make sure your replicates are the first columns of your treatments.

ADD COMMENT
0
Entering edit mode
9.2 years ago

For a correct usage of FunPat, I suggest reading carefully the guideline reported in section 5 of the vignette and trying the toy-example provided in the package. As far as I understood from the example proposed above, there are two conditions (E vs. W), 12 time points and, for each time point and condition, 6 replicates available (A,B,...,F). FunPat works only with normalized expression data, thus start to build the input data, i.e. "data1", "data2", "replicates", always AFTER the normalization step, e.g. TMM.

When you are going to build these input matrices, remember that "data1" and "data2" represent the two time series that you want to compare, whether the two columns of "replicates" represent the comparisons that you want to use to build the error model. Indeed, the null hypothesis for the bounded area is built by considering the difference abs(C1-C2), where C1 and C2 represent the two columns of the "replicates" matrix.

In the experimental design reported above, it is possible to have more than one replicate available across the entire time series in both conditions, which suggests the usage of FunPat as described in the scenario C in section 5 of the vignette. The scenario C shows the optimal experimental design to work with FunPat, since it is possible to mean the two replicates both to build "data1" and "data2" and to build the error model from the "replicates" matrix, thus avoiding any over-estimations of the biological variability. Indeed, in scenario C the treatment starts AFTER time 0, thus it is possible to use the treatment and control samples at time 0 for building the error model, averaging across the two corresponding replicates to create C1 and C2 of the "replicates" matrix. The reason is that at time 0 the "treatment" can be used like a replicate of the control condition.

However, most experimental designs, as the case above, have no "pre-treatment" condition that enables to use, as shown in scenario C, the doubled number of replicates at the same time point (e.g. the 4 replicates at time 0) to estimate the variability of the averaged expression value used in the comparison between "data1" and "data2" (e.g. the mean expression obtained from the averaging of the 2 replicates at all time points). Anyway, in these cases there are no problems in using FunPat, but for a good usage remember to:

  • compare replicates (i.e. first vs. second column of "replicates" matrix) of the same condition and time point. Addressing the example described above and considering the replicates HA2015_E_1A, HA2015_E_1B, HA2015_E_1C, HA2015_E_1D, HA2015_E_1E and HA2015_E_1F for the condition E at time point 1, in the matrix "replicates" you can use the average among replicates A, B, C in the first column and the average among replicates D, E, F in the second column. If you have more conditions and time points with replicates available, as in this case, you can concatenate the expression values of each comparison used, as shown in the scenario B of section 5 in the vignette.

  • build "data1", "data2" and the columns of "replicates" by always averaging the same number of replicates. Looking at the example, if you average across three samples (sets {A,B,C} and {D,E,F}) to build "replicates", to build "data1" and "data2" you have to mean across only three replicates too! In order to achieve the best robustness in the results, I suggest averaging as much as possible by maximizing the number of replicates used for the average to build the matrices "data1", "data2" and "replicates".

Therefore, for the experimental design described above, all the time points and conditions have at least 6 replicates, thus for the matrix "replicates" it is possible to average up to three replicates for each of the two columns, concatenating the comparisons at different time points and conditions. If you do this, you need to build "data1" and "data2" by averaging only three replicates and not all the 6 replicates available, otherwise the error model built from "replicates" over-estimates the variance that is used to compare "data1" vs. "data2". Since you use only 3 replicates for the comparison, I suggest running SEL.TS.AREA more times with different combinations of replicates (the parameters is.interactive and non.interactive of function SEL.TS.AREA can come in handy). If the data are well normalized, you should see slightly differences in the list of selected genes to be used as seeds/candidates in the clustering step. Since the clustering step also requires the p-values, you could consider for each gene the median p-value resulting from all the tested combinations.

What is the side-effect of the over-estimation of the variance? If you decide to average the expression in "data1" and "data2" by a higher number of replicates with respect to those used to build the columns in "replicates", the over-estimation of the variability lead to higher p-values, thus a more conservative selection of the genes as seeds/candidates. However, if you select less seed genes, the function-based clustering step is able to select possible false negatives from the set of candidate genes. In any case, my suggestion is, whenever is possible from the available experimental design, to preserve the same number of replicates for averaging across all the input matrices used for SEL.TS.AREA.

ADD COMMENT

Login before adding your answer.

Traffic: 1791 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6