I have two datasets, one with ~250 sample, and another with 7 samples. Both datasets are of RPKM values computed from human RNA-Seq. I don't have access to the primary reads files.
Is there a good way to batch-correct these datasets so that I can combine them and scan for expression signatures? I'm currently using an algorithm that creates a geometric average of the RPKM values for groups of genes that belong in a specific signature in order to compare samples, but the RPKM values of the ~250 sample dataset are on average much higher than the 7 sample dataset.
I've used ComBat in the past for the same predicament but with microarray expression data, and it worked perfectly. I'm looking for something analogous for RPKM expression data.
What are you going to do with the combined data? If you are going to do differential expression analysis, what are the groups?