Hi all,
I'm starting a project where I'm going to gather several human public RNAseq datasets to perform a differential expression meta-analysis, the objective is to analyse multiple study to detect a signal that wouldn't be found in individual study due to low number of samples.
I will end up with a lot of samples (>500), and since I'm not a statistician I'm wondering what issues I might face with this high number.
Should I expect to gain power with a large meta-dataset? Or will mixing several studies will bring too much confounding effects?
Is there some threshold in the number of samples I should gather? maybe adding more and more will just bring noise and make the analysis more difficult?
In you opinion, will a tool like DEseq2 will be fitted to analyze this kind of large dataset? Or should I use another type of approach to detect differential expressed genes?
Thank you in advance for any of your input on this
Just keep in mind that whenever you mix different datasets together, you will have batch and other confounding effects, which sometimes would be even not possible to resolve. So at least try to retrieve datasets which are very similar to each other rom the perspective of sequencing technology, instrument, read length, etc, and of course biological condition.
Ok thanks, I will keep in mind those constraints.