Question

ANOVA and Principal Component Regression

1

Entering edit mode

10.1 years ago

adnanjaved1988 ▴ 80

I Just need your valuable suggestions.

This is how my data frame look like. This is Back ground Subtraction values from 5 samples of micro array data.

A is parent sample.
B C D E they are treatment. Among treatments B is the sample which is resistant to drugs applied on it.

I have no duplicates of miRNAs in 5 samples so instead of writing miRNAs names for every sample I just them once. So 5 samples have 2019 rows and and each row represents miRNAs but the values of samples in front of that miRNAs different for each sample. They are expression values.

                                          A        B         C         D
hsa-miR-199a-3p, hsa-miR-199b-3p         NA   13.13892  5.533703  25.67405
hsa-miR-365a-3p, hsa-miR-365b-3p   15.70536   52.86558 18.467540 223.51424
hsa-miR-3689a-5p, hsa-miR-3689b-5p       NA   21.41597  5.964772        NA
hsa-miR-3689b-3p, hsa-miR-3689c     9.58696   44.56490 10.102051  13.26785
hsa-miR-4520a-5p, hsa-miR-4520b-5p 18.06865   28.06991        NA        NA
hsa-miR-516b-3p, hsa-miR-516a-3p         NA   10.77471  8.039662        NA
                                          E     
hsa-miR-199a-3p, hsa-miR-199b-3p         NA
hsa-miR-365a-3p, hsa-miR-365b-3p   31.93503
hsa-miR-3689a-5p, hsa-miR-3689b-5p 24.26073
hsa-miR-3689b-3p, hsa-miR-3689c          NA
hsa-miR-4520a-5p, hsa-miR-4520b-5p       NA
hsa-miR-516b-3p, hsa-miR-516a-3p         NA

For Anova I reshaped my data frame into:

head(m)
                                 MiRNAs                Group    value
1                  hsa-miR-199a-3p, hsa-miR-199b-3p     A       NA
2                  hsa-miR-365a-3p, hsa-miR-365b-3p     A 15.70536
3 hsa-miR-3689a-5p, hsa-miR-3689b-5p, hsa-miR-3689e     A       NA
4                   hsa-miR-3689b-3p, hsa-miR-3689c     A  9.58696
5                hsa-miR-4520a-5p, hsa-miR-4520b-5p     A 18.06865
6                  hsa-miR-516b-3p, hsa-miR-516a-3p     A       NA

2019 miRNAs for sample A

2019 miRNAs B and so on. By using ANOVA1<-aov(m$value~m$Group) and then TukeyHSD

TukeyHSD(ANOVA1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = m$value ~ m$Group)

$`m$Group`
          diff        lwr       upr     p adj
B-A   73.87304  -88.20262 235.94869 0.7256734
C-A  -25.55832 -196.36413 145.24749 0.9941714
D-A  203.80312   20.26110 387.34514 0.0207431
E-A   41.04993 -159.09661 241.19648 0.9807637
C-B  -99.43136 -258.28853  59.42581 0.4290920
D-B  129.93008  -42.54789 302.40805 0.2398572
E-B  -32.82310 -222.87472 157.22851 0.9899165
D-C  229.36144   48.65517 410.06771 0.0048776
E-C   66.60826 -130.94103 264.15755 0.8892989
E-D -162.75319 -371.41264  45.90627 0.2081150

My Question is, do I need to perform ANOVA with Control Vs treatment? or the way I performed is correct? How I can perform Principal Component Regression for this data?

R • 2.8k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by adnanjaved1988 ▴ 80

0

Entering edit mode

What question are you trying to answer with this data? It's highly unlikely that the Anova you performed will correctly answer any biological question you'd be interested in asking.

ADD REPLY • link 10.1 years ago by Devon Ryan 104k

0

Entering edit mode

My Question is, do I need to perform ANOVA with Control Vs treatment? Or the they way I performed is correct. Do I need to exclude control and just check variablity amoung treatments which are most significant.

A is Control and B C D E are treatments.. B is resistant to drugs which are used for treatments

ADD REPLY • link updated 10.1 years ago by Devon Ryan 104k • written 10.1 years ago by adnanjaved1988 ▴ 80

1

Entering edit mode

If B, C, D and E are different treatments, which I assume is the case given what you've written, then you can't do an ANOVA (and the one you showed makes absolutely no sense...it's not even testing something coherent). Perhaps you can get limma to estimate dispersions in a group-blind manner and then use that in its linear model...but I expect the results will still be crappy. To be frank, you're largely wasting your time with this dataset.

ADD REPLY • link 10.1 years ago by Devon Ryan 104k

1

Entering edit mode

I got curious... Why do you say that anova makes no sense?

aov(m$value~m$Group) tests whether any of the "Group" means in miRNA value is different from another. Tukey's test says that D vs A and D vs C are different.

(Maybe the model could be improved by nesting the error since there are treatments within miRNA, but I don't see it as non-sensical; also, whether it makes biological sense I don't know).

If on the other hand adnanjaved1988 is interested in which miRNA are different than, yes, there is no way to go about it as there is no replication within treatments.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by dariober 15k

1

Entering edit mode

It makes no biological sense and is therefore nonsensical. Further, the background distribution is likely not even remotely gaussian and how should one even deal with the NAs in the dataset (aov will just remove them...but that's not fair in this case since we need to know why things are NA). In the unlikely event that they're looking at, say, a dicer knockout or some other knockdowns of various components of the miRNA processing machinery then asking generally about miRNA changes becomes more interesting. Then, however, the errors would need to be nested (or better yet, simply a different test use...like measuring AUC for the miRNA peak on a bioanalyzer on multiple samples and then doing statistics on that).

ADD REPLY • link 10.1 years ago by Devon Ryan 104k

2

Entering edit mode

Actually, the main problem is that for each treatment there is just one sample analyzed. Although each sample is "measured" ~2000 times all you can say is that, e.g., sample D is different from C but you can't generalize to saying "Treatment D != C" since the difference might be due to that particular sample prep or the array etc. This strongly limits (invalidates?) the biological relevance of the analysis, I agree.

About NAs, I would be less worried if they are sparse and random (which might be the case?) and non-normality might be curable.

@adnanjaved1988, for the record nesting can be specified like aov(m$value~m$Group + Error(m$Group / m$MiRNAs)); but again, be careful about the interpretation.

ADD REPLY • link 10.1 years ago by dariober 15k

0

Entering edit mode

Excellent point.

ADD REPLY • link 10.1 years ago by Devon Ryan 104k

0

Entering edit mode

Hey Dariober Thanks for your comment

The main purpose for this study is to see miRNA expression level with the treatments applied. These samples are from the patients of institute where I am working and they want to see miRNAs expression level in exosomes of breast cancer patients.

The Array they used had 2019 miRNAs So they want to see which treatment (combination of Drugs) causes differential expression of those miRNAs.

As I have parent cell line which shows their normal expression and when they applied drugs on other cell line definitely the expression among other groups changed,Some showed significant high change of expression.. So I am doing these tests to see which group is changed from which group.

My main role is to use specific miRNAs from my data set as a Biomarker for cancer identification. and with no offense

I don't know why it makes no sense for Mr Devan Reyan. By the way in many posts I saw him writing this sentence (It makes no sense) :D anyways

Best,
Adnan

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by adnanjaved1988 ▴ 80

0

Entering edit mode

Hey Thanks Dariober :)

Can you suggest me How I can improve my model by nesting the error

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by adnanjaved1988 ▴ 80

Ram · Answer 1 · 2014-11-13

For handling NAs I firstly removed those rows of my data frame where there were 5 NAs. and 4 NAs.

For handling rest of NAs in rows I used three methods and see what are the differences which I will get by using these methods

assigning row means ( which is OK but not very potential because you are not getting new Information.

a mirna which is overall highly correlated with the mirna having the missing value and taking a value derived from that mirna. example below...

miRNA-1 values: 1 2 3 NA 5
miRNA-2 values: 2 4 6 8 10

==> replace the missing value derived from the second miRNA by 4

One method is "K-Nearest neighbours (KNN impute)" method for imputation to deal with the NA values.