Entering edit mode
5.6 years ago
hAjmal
▴
50
Hi, I am usng TCGA biolinks to do DE analysis of genes in the breast cancer TCGA data. The number of samples in normal and tumor groups are different. Is is possible to do DE analysis with different number of samples in each group? Please guide.
Hi hAjmal, how many is "different"? Please give some numbers.
112 normal and 1100 tumor samples
I would guess (not being a statistician) that the dispersion estimates at these high sample numbers should be sufficiently stable regardless of the uneven group sizes. You can of course run several analysis with subsets of the tumor group and see if the results are stable when subsampling.
If these are a randomly selected set of normal and tumour samples then it's fine. The issue I suspect you may have here, is that the normal may actually be non-tumour tissue from a tumour-proximal site in a subset of the cancer patients; if so, you will need to match the patient-derived samples.
As your normal is too less compared to case, you may get biiased result..... Increase the number of Normal or lessen the number of case. Equal set is always preffered.
The number of samples is not a concern here, the batch effect is. You best option is to match tumor and healthy, another option is to add the batch as a covariate, let's hope the tumor and healthy are not grouped separately.