Hi,
I am attempting to analyze some RNAseq data with respect to a couple of continuous variables. I am using DESeq2 in R to do this, but am running into a problem. Below is the code I am using:
dds <- DESeqDataSetFromHTSeqCount(samples, directory=".", design=~fast)
dds <- DESeq(dds)
res <- results(dds)
This has worked out fine for some of my other continuous variable, giving me a nice list of genes that change with respect to the variable in question. However, for 2 of the variables, there seems to be one samples that is completely skewing the data as you can see below:
count fast
2 5.194365 65.25974
4 8.032771 65.79634
5 10.929044 35.18518
6 3.501335 63.21429
7 13.352367 53.29342
10 8.261876 59.53079
14 20.103149 45.50562
16 6.315940 64.55331
17 10.014749 53.15985
19 7.377103 46.86469
24 5.593491 58.26772
26 11.172046 67.38461
27 9.525122 62.40000
31 2.556560 76.26373
33 3.521462 61.88679
39 5.633191 58.42697
40 5.482473 54.71698
1 10.567494 55.12144
12 7.319713 49.79920
13 4.362853 53.90836
15 12.794649 76.51869
18 9.682205 55.38462
20 6.072752 64.04494
22 12.648017 61.78660
23 3.287383 55.73034
25 10.516274 24.82269
28 20.266891 39.63636
29 2.744838 74.68750
30 14.990684 55.26316
32 9.224983 51.36364
34 4.702022 45.55874
35 3.972492 48.58657
36 2.542509 51.83246
37 7.500402 44.91228
3 6.942850 74.48649
8 4.000244 63.24786
9 10.290107 67.70187
11 1.928383 61.51079
21 7.866473 76.54958
38 108.088894 11.69065
As you can see, the very last sample seems to be much greater than the rest and is skewing the data, and looking at the rest of the values, this gene shouldnt be differentially changed. If I remove this sample, then I get the same issue but with another sample being the problem. This example doesn't seem too extreme, but some of the genes, the values are all around 10, then there is the one in the thousands, skewing the analysis.
Phenotype table:
files slow fast status
1 2.counts 34.74026 65.25974 con
2 4.counts 34.20366 65.79634 con
3 5.counts 64.81481 35.18518 con
4 6.counts 36.78571 63.21429 con
5 7.counts 46.70658 53.29342 con
6 10.counts 40.46921 59.53079 con
7 14.counts 54.49438 45.50562 con
8 16.counts 35.44669 64.55331 con
9 17.counts 46.84015 53.15985 con
10 19.counts 53.13531 46.86469 con
11 24.counts 41.73228 58.26772 con
12 26.counts 32.61538 67.38461 con
13 27.counts 37.60000 62.40000 con
14 31.counts 23.73626 76.26373 con
15 33.counts 38.11321 61.88679 con
16 39.counts 41.57303 58.42697 con
17 40.counts 45.28302 54.71698 con
18 1.counts 44.87856 55.12144 pre
19 12.counts 50.20080 49.79920 pre
20 13.counts 46.09164 53.90836 pre
21 15.counts 23.48131 76.51869 pre
22 18.counts 44.61538 55.38462 pre
23 20.counts 35.95506 64.04494 pre
24 22.counts 38.21340 61.78660 pre
25 23.counts 44.26966 55.73034 pre
26 25.counts 75.17731 24.82269 pre
27 28.counts 60.36364 39.63636 pre
28 29.counts 25.31250 74.68750 pre
29 30.counts 44.73684 55.26316 pre
30 32.counts 48.63636 51.36364 pre
31 34.counts 54.44126 45.55874 pre
32 35.counts 51.41343 48.58657 pre
33 36.counts 48.16754 51.83246 pre
34 37.counts 55.08772 44.91228 pre
35 3.counts 25.51351 74.48649 sarc
36 8.counts 36.75214 63.24786 sarc
37 9.counts 32.29814 67.70187 sarc
38 11.counts 38.48921 61.51079 sarc
39 21.counts 23.45041 76.54958 sarc
40 38.counts 88.30935 11.69065 sarc
Is there anyway around this? I have tried having a look but can't seem to figure out how to overcome this problem.
Thanks for any help
Which is the "very last sample" ? The fast seems to be a condition here. How many samples do you have ? What is your design ? Did you do any exploratory analysis to look for PCA etc ?
Thanks for the reply. The 'very last sample' just happens to be the last sample in the data frame above, sample 38. Fast is indeed the variable, it is the percentage of fast fibres in a muscle sample. Total, there are 40 samples, the samples are grouped based on disease status, with the first 17 as controls, next 17 are disease 1 and last 6 are disease 2. I have done categorical analysis to look at differences between the different groups, but now want to do analysis looking at continuous variables, fast in this case.
I have done PCA, plotted heatmaps of the most variables genes, boxplots, etc, and there is nothing out of the ordinary between the samples. They do not group on the PCA plot, but none of the individual samples is goruping as an outlier.
Thanks
Can you append your post with a phenotype table, to illustrate your experimental design?
I have added it to the original post. Hopefully that'll make things a bit clearer
Sorry for posting this, but does anyone have any ideas on what I can do with this? Thanks
Editing your original post "bumps" it up to front page. No need to
Submit Answer
to achieve the same result.