Currently, I am trying to analyze the RNA-Seq (or other gene expression) data from approximately 1000 different samples, so I obtain a matrix of dimension 20000 (no. of genes) x 1000 (no. of samples), where each entry reflects the gene expression of gene i in sample j.
My aim is to predict the gene expression in each of these samples. For that purpose I would do a 10-fold CV on the samples, i.e., I split the samples in 10% chunks and try to predict the gene expression values for a particular sample by a model fitted on those 9 chunks of 10% samples in which the sample is not contained in.
Now there are several samples each that are replicates for one cell line. Hence it might by that I predict the gene expression values of a sample (cell line) by a model fitted on other samples where some of them are the replicates for the cell line.
Is such a thing conceptually correct or not? Additionally, it might be worth to add that the correlations between the replicate samples (over all the genes) are in a similar range as most other correlations between any two samples (0.4-0.9).
Many thanks for your help in advance!