I have my data called count.mat
.
count.mat <- structure(list(Sample_1 = c(2968, 1701, 272, 249, 186, 1598,
98, 9, 563, 1003, 278, 242, 137, 625, 614, 494, 681, 475, 3,
89), Sample_2 = c(2057, 1237, 339, 688, 289, 1354, 265, 17, 402,
793, 343, 267, 389, 477, 838, 663, 741, 586, 9, 244), Sample_3 = c(5,
3, 5, 6, 2, 14, 2, 9, 4, 4, 3, 3, 1, 1, 2, 1, 16, 7, 3, 4), Sample_4 = c(5,
4, 1, 4, 1, 6, 1, 10, 9, 7, 2, 2, 2, 5, 4, 3, 3, 6, 1, 4), Sample_5 = c(8175,
4828, 2268, 3256, 1712, 4048, 1208, 10, 3166, 4315, 2536, 1610,
1531, 2794, 3976, 3030, 2427, 4788, 9, 1708), Sample_6 = c(4170,
2068, 506, 220, 190, 1070, 76, 10, 790, 1635, 264, 146, 67, 558,
771, 423, 481, 864, 2, 77), Sample_7 = c(15, 6, 4, 3, 2, 4, 2,
15, 5, 5, 2, 7, 2, 4, 7, 8, 3, 11, 22, 4), Sample_8 = c(15, 7,
2, 7, 4, 4, 1, 13, 5, 5, 4, 3, 3, 4, 3, 4, 2, 4, 5, 4), Sample_9 = c(132,
356, 41, 29, 24, 40, 14, 61, 40, 66, 29, 136, 18, 99, 66, 144,
55, 115, 8, 42), Sample_10 = c(1199, 614, 491, 757, 96, 417,
170, 47, 486, 452, 437, 189, 563, 481, 857, 1003, 266, 595, 3,
185), Sample_11 = c(6853, 4667, 4051, 5964, 2940, 6831, 2762,
107, 9957, 7058, 4048, 4023, 4366, 5235, 5234, 5380, 2904, 6320,
15, 5502), Sample_12 = c(6332, 4489, 2739, 5371, 2153, 2841,
1982, 63, 6032, 4417, 2025, 4630, 3264, 3666, 4629, 4601, 2240,
3408, 15, 4309), Sample_13 = c(16, 14, 10, 1, 2, 22, 1, 2, 5,
8, 3, 2, 7, 4, 6, 4, 9, 2, 2, 7), Sample_14 = c(12, 6, 5, 4,
2, 6, 2, 13, 9, 5, 4, 3, 4, 1, 9, 5, 2, 5, 2, 2), Sample_15 = c(6058,
3363, 1458, 2846, 1315, 3496, 1119, 2, 2028, 4566, 1097, 1135,
811, 1743, 2310, 2023, 1121, 2420, 10, 2224), Sample_16 = c(23362,
10663, 2875, 6581, 2264, 4756, 1907, 2, 4405, 4305, 4519, 4197,
2786, 3201, 9012, 4905, 3179, 3493, 1, 2829), Sample_17 = c(7548,
2659, 1018, 1363, 376, 7433, 704, 14, 1405, 2772, 1099, 788,
511, 796, 1721, 1239, 1137, 1519, 6, 606), Sample_18 = c(19,
6, 9, 5, 3, 18, 2, 9, 7, 12, 2, 5, 3, 5, 15, 7, 2, 8, 16, 3),
Sample_19 = c(25, 11, 1, 5, 1, 13, 4, 23, 13, 12, 3, 7, 2,
4, 6, 4, 2, 4, 8, 2), Sample_20 = c(7, 3, 2, 4, 1, 3, 2,
13, 3, 2, 1, 2, 2, 1, 5, 5, 1, 2, 6, 2)), row.names = c("AAAACAAGTTCTGAATCGTATA",
"AAAACAAGTTCTGAATCGTATAA", "AAGAAAAGATCATGTCCCAAGA", "AAGAAGAGAATCACAGCCGCCA",
"AAGAATCCTTCAAAGGGCCTAGA", "AAGCTGAAGTTTTGACATGTCA", "AATCACAGCCGCCAAGTCCTCTA",
"AATCAGTGAACGAAAGTTAGG", "AATCTAAAGTTTGTCCCATTA", "AATCTTTGACTACATTACGTAAA",
"AATGTCCAATGGAACGCCGTTA", "ACCATCTCCGTGTCTACAATCA", "ACTCCAGGCGAGCATGCAGTCCA",
"ACTCTTAGAGATGAATATTCCGA", "AGGGATCCTTATGTCCTAGCCA", "AGTACAAGTCTACGAGCCCATCGA",
"AGTCCAATGGATCTTTCACTCGA", "AGTCCTTGCCAATGAGGCTTTTA", "AGTGAACGAAAGTTAGGGGAT",
"AGTGCTAGTTCCCAACGTCGTTGA"), class = "data.frame")
I have my conditions as:
sample.type <-
as.factor (
c(
"Ago2_SsHV2L",
"Ago2_SsHV2L",
"Ago2_VF",
"Ago2_VF",
"Ago4_SsHV2L",
"Ago4_SsHV2L",
"Ago4_VF",
"Ago4_VF",
"Dcl1_SsHV2L",
"Dcl1_SsHV2L",
"Dcl2_SsHV2L",
"Dcl2_SsHV2L",
"SlaGem_2mer",
"SlaGem_2mer",
"WTDK3_SsHV2L",
"WTDK3_SsHV2L",
"WTDK3_SsHV2L",
"WTDK3_VF",
"WTDK3_VF",
"WTDK3_VF"
)
)
infection.status <-
as.factor(
c(
"SsHV2L",
"SsHV2L",
"VF",
"VF",
"SsHV2L",
"SsHV2L",
"VF",
"VF",
"SsHV2L",
"SsHV2L",
"SsHV2L",
"SsHV2L",
"SlaGem_2mer",
"SlaGem_2mer",
"SsHV2L",
"SsHV2L",
"SsHV2L",
"VF",
"VF",
"VF"
)
)
cond <- data.frame(sample.type,infection.status, stringsAsFactors=TRUE)
cond <- as.data.frame(cond)
Deseq Analysis:
library(dplyr)
# Doing for infection.status
dds <- DESeqDataSetFromMatrix(countData=count.mat,colData=cond,design=~infection.status)
dds
dds$infection.status <- relevel(dds$infection.status, ref = "VF")
dds = DESeq(dds, fitType = "parametric")
my.names<-resultsNames(dds)
Since I wanted to compare the DE between SsHV2L_vs_VF, I have done this:
res<- results(dds, contrast=list(c("infection.status_SsHV2L_vs_VF" )), test="Wald")
res <- as.data.frame(res)
res <- res[with(res, order(padj)), ] %>% select (colnames(res))
However, the result does not look correct. The VF samples have less than 20 counts in all samples and SsHV2L samples have more than 1000 counts. If you look at the padj, the pvalues are significant only for the reads with the fewer counts in both VF and SsHV2L samples. I was wondering why it's not significant for the reads with 1000s of counts in SsHV2L columns and fewer in VF. My data have only 130 rows and I was wondering if that would be the reason causing this problem. Can someone please help me if I am doing it right? Thanks!
Which type of data is this?
This is smallRNA seq data.
A quick search reveals a few posts on Bioconductor where the DESeq/DESeq2 developers seem to shy away from recommending DESeq for such data. I saw replies from Michael, Simon, and Wolfgang. Your data does follow a negative binomial, though (I checked via histogram).
These are obviously raw counts that you have? You could try some rudimentary median-based normalisation strategy and then test each gene independently via negative binomial regression using
glm.nb()
. I presume that you exhaustively searched for other methods.@Kevin Blighe Thanks Kevin for your answer. Yes, I wanted to look for miRNA-like reads with differential expression between samples. I have seen some papers using DESeq for these reads and wanted to give it a try.
In which journals were they? They may have only used DESeq2 for, for example, size factor calculation (?).