I want to do clustering on RNA-seq Time Series gene expression Dataset on WT MCF10a, Control VS Treat. Time course stimulation applied (0, 15, 40, 90, 180, 300 min). All samples made in triplicate. Dataset Link : https://github.com/daniel-spies/rna-seq_tcComp/tree/master/data
I have studied a clustering algorithm, DP_GP_cluster, which clusters genes by expression over a time course using a Dirichlet process Gaussian process model for time course data . But the main drawback is it takes data in the following format:
gene 1 2 3 ... time_t
gene_1 10 20 5 ... 8
gene_2 3 2 50 ... 8
gene_3 18 100 10 ... 22
...
gene_n 45 22 15 ... 60
where the first row is a header containing the time points and the first column is an index containing all gene names. Entries are delimited by tabs. But my data has replicates at each time point, i.e. of the following format:
"TP1_1" "TP1_2" "TP1_3" "TP2_1" "TP2_2" "TP2_3" "TP3_1" "TP3_2" "TP3_3" "TP4_1" "TP4_2" "TP4_3"
"gene_1" 202 218 305 352 403 329 190 182 186 235 147 252
"gene_2" 15 14 15 2 3 4 0 0 0 10 47 15
"gene_3" 273 180 201 324 414 235 264 261 239 285 240 290
"gene_4" 868 875 986 944 946 898 1168 1020 731 740 834 917
"gene_5" 5 2 2 0 0 0 1 7 2 0 0 0
"gene_6" 683 662 789 1036 940 671 1004 731 660 1439 1102 1334
"gene_7" 19 20 23 38 34 44 63 42 48 40 37 54
Is there a technique to do clustering on Time Series Gene Expression count data considering all it's replicates.
You can either take the mean of each gene x time-point as input or treat the three replicates as three different genes, run the clustering and hope all three replicates are in the same cluster.
But sir, wouldn't that be biologically incosistent ?