Dear Community,
I am struggling with a problem: I have RNAseq data from 24 samples and for these I have the survival data as well. What I want his to figure out the most interesting genes based upon Survival analyses and clustering.
The sample data that I have is
RNAseq (read counts):
Gene S1_cellTr S2_cellTr S3_cellTr S4_cellTr S5_cellTr S6_cellTr S7_cellTr S8_cellTr S9_cellTr S10_cellTr S11_cellTr S12_cellTr S13_cellT S14_cellT S15_cellT S16_cellT S17_cellT S18_cellT S19_cellT S20_cellT S21_cellT S22_cellT S23_cellT S24_cellT
gene1 725 230 2344 657 243 246 290 868 1722 1534 86 332 174 812 101 310 530 820 1380 200 520 1416 548 196
gene2 7 18 1 5 2 31 48 7 1 63 1 1 0 0 1 12 1 17 66 2 78 12 76 118
gene3 426 242 854 1490 336 308 20929 3515 858 1205 498 2941 959 555 113 185 1295 5579 9828 173 721 385 468 20169
gene4 11 43 1 110 19 10 3 95 3 86 167 11 274 3779 25 7 2 69 220 16 548 11 38 131
gene5 1567 1392 1224 2317 731 1436 1213 6124 5214 1861 416 1145 2666 2314 408 2939 1108 2178 4357 1699 3199 1462 1623 2056
gene6 1055 1695 209 502 1408 922 738 164 3699 700 589 31 1655 1351 481 2212 645 2023 2932 755 1278 937 193 229
gene7 77 596 185 248 40 145 396 62 437 678 128 103 47 2323 178 106 49 131 1797 110 329 125 244 64
gene8 130 415 1369 518 28 604 1693 311 961 383 959 610 1831 194 562 165 5 2228 1135 593 436 47 34 1170
gene9 8191 2975 3032 3497 3317 1682 3205 5322 13686 6487 2398 3127 2729 4431 1931 8238 2670 10236 10720 3501 11154 6477 14769 7201
gene10 1043 655 917 859 530 457 502 1447 1160 837 259 369 569 2930 412 1911 296 764 1096 722 1266 477 708 920
gene11 70 68 13 256 198 46 1443 1011 154 59 19 119 91 381 109 103 40 95 163 80 533 62 29 920
gene12 1404 755 3237 1653 2719 1460 958 11393 6973 2901 1853 2843 38 4402 411 614 3146 1829 2721 1600 464 3920 3094 2677
gene13 1115 1667 979 1791 424 878 1560 2180 3395 1262 924 1204 4778 1342 476 1779 1571 1827 2810 416 2524 828 1719 1617
gene14 225 2017 687 206 167 260 1519 157 396 365 88 93 122 1105 197 54 132 1944 1562 97 381 765 40 184
gene15 11 60 22 40 70 107 306 10 16 18 13 19 252 9 8 370 10 315 191 64 66 8 33 134
Clinical data
Sample AGE CLINICAL_STAGE DAYS_TO_BIRTH DAYS_TO_COLLECTION DAYS_TO_DEATH DAYS_TO_INITIAL_PATHOLOGIC_DIAGNOSIS DAYS_TO_LAST_FOLLOWUP DFS_MONTHS DFS_STATUS OS_MONTHS OS_STATUS
S1_cellTr 55 Stage IIB -20114 903 NA 0 1005 39.26 DiseaseFree 39.26 LIVING
S2_cellTr 80 Stage IB -29420 349 NA 0 377 22.54 DiseaseFree 22.54 LIVING
S3_cellTr NA NA 957 NA NA NA NA NA
S4_cellTr 74 Stage IIIC2 -27317 332 50 0 NA NA 1.64 DECEASED
S5_cellTr 60 Stage IA -22169 2772 1106 0 NA 9.92 Recurred/Progressed 36.33 DECEASED
S6_cellTr 72 Stage IVB -26556 199 NA 0 22 0.72 DiseaseFree 0.72 LIVING
S7_cellTr 60 Stage IIIA -22010 323 NA 0 210 25.99 Recurred/Progressed 32.69 DECEASED
S8_cellTr 71 Stage IA -26044 253 NA 0 248 11.5 Recurred/Progressed 18.4 LIVING
S9_cellTr 57 Stage II -20841 158 NA 0 275 22.63 DiseaseFree 22.63 LIVING
S10_cellTr 68 Stage IIIC1 -25063 878 NA 0 1026 13.37 Recurred/Progressed 36.1 DECEASED
S11_cellTr 70 Stage IIB -25812 274 NA 0 337 23.03 DiseaseFree 23.03 LIVING
S12_cellTr 66 Stage IB -24260 975 NA 0 1095 62.98 Recurred/Progressed 77.27 DECEASED
S13_cellT 80 Stage IB -29234 241 NA 0 469 27.14 DiseaseFree 27.14 LIVING
S14_cellT 74 Stage IB -27077 NA NA 0 239 22.47 Recurred/Progressed 30.98 DECEASED
S15_cellT 75 Stage IIIC1 -27581 156 NA 0 257 15.8 DiseaseFree 15.8 LIVING
S16_cellT 70 Stage IB -25573 1416 NA 0 107 48.75 DiseaseFree 48.75 LIVING
S17_cellT 47 Stage IA -17308 128 NA 0 241 31.96 DiseaseFree 31.96 LIVING
S18_cellT 56 Stage IB -20797 715 NA 0 774 45.73 DiseaseFree 45.73 LIVING
S19_cellT 83 Stage II -30351 101 NA 0 52 15.6 DiseaseFree 15.6 LIVING
S20_cellT 60 Stage IA -22098 1067 NA 0 1064 68.33 DiseaseFree 68.33 LIVING
S21_cellT 53 Stage IV -19448 254 72 0 NA 1.97 Recurred/Progressed 2.37 DECEASED
S22_cellT 61 Stage IIIC -22378 86 NA 0 371 34.95 DiseaseFree 34.95 LIVING
S23_cellT 65 Stage IA -23861 137 NA 0 23 15.77 Recurred/Progressed 18.07 LIVING
S24_cellT 82 Stage IIIC2 -30095 70 NA 0 166 12.58 DiseaseFree 12.58 LIVING
This is just a small subset of sample data. The original data has 80 genes and above 400 samples.
Has anyone done this before and can guide me through the R script or something.
Many thanks in advance.
What are your covariates?
This is all I have got. May be OS can be used, I am very new to this kind of analysis.
When you are looking for significant genes, you are supposed to have one or more conditions (factors) to compare against (for eg treated vs untreated, normal vs proband, male vs female, time series - one or more combinations of these).
These are all female samples and I don't have the normals. What I can think of that the median expression can be made used in deceased and living(may be).