Question

Tactics for analysis of large number of highly similar diseased samples from RNA-seq&microarray data

0

Entering edit mode

9.9 years ago

chris86 ▴ 400

Hi

This is quite a broad question, but I hope I will have some interesting answers. I am going to have a large number of RNA-seq & microarray samples approximately 150-300 for a disease phenotype and we expect the gene expression to be very similar between these patients, but there should be certain sub populations that respond to treatment well or not well. I am concerned that although there certainly will be clinically relevant gene expression differences between sub populations detecting these is going to be hard work. For example detecting differential expression with tools like limma/deseq may give few DE genes between sub populations because of substantial noise. I am intending to filter the genes based on variance to counter this problem, but once filtered on variance I am aware I cannot use limma/deseq etc and will need to use a ordinary t test or something similar. This is one tactic to increase statistical power. Does anyone have any good tips for the analysis of this type of data, where we do not expect huge expression differences between subgroups?

Thanks,
Chris

sequencing genome RNA-Seq gene R • 2.3k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 9.9 years ago by chris86 ▴ 400

0

Entering edit mode

Analysing diseased tissues is pretty tricky, human samples are highly variable due to age, gender, lifestyle factor medications etc and tissue admixture commonly confounds. Some advice:

-LOTS of Ns, more is always better.

-Gender and age matching of groups

-Restrict other factors such as medications and lifestyle factors in sample selection

-Use a GLM to correct for confounders you can't avoid like age and gender.

-Expect tissue admixture to be a problem and you may be able to omit outliers in this basis.

-In general Limma/Voom is more conservative than DESeq and edgeR. EdgeR is the least conservative and sometimes calls outliers as "significant"; the prior df parameter influences this heavily.

-Filtering on variance is not a particularly good idea. For RNA-seq, you can try to filter out genes that aren't expressed above an arbitrary level (ie 10 reads per sample on average) which reduces FDR penalty because you are left with 15,000 expressed genes instead of 60,000 genes in the case of Ensembl annotation.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.9 years ago by mark.ziemann ★ 2.0k