I make a resource to estimate the gene expression levels across many plant tissues using the RNASeq data . I have collected the dataset of different experimental samples from GEO and other sources. Now, Using HTSeq, I estimate the count for each sample (i.e., samples from different experiment). Finally, I merge all the dataset to a single source, so that the expression level of a gene can be viewed across all samples (using heatmap of count data). But, I concern about the significance of my method. Could anyone tell about my strategy?
I have two specific doubt,
- Is it significant to merge the data since the different experiment may have the 'batch effect'?
- If it is ok to merge sample, I should consider the HTSeq count data or FPKM for the heatmap?
Thanks
What do you mean by merge samples ?
Generally it should be ok to take different GEO data sets and compare them provided they are similar type of experimental designs and different conditions/cell lines.
What is the variation in terms of number of reads per sample across different samples ?
You need to normalise the data before you plot any heatmaps.