Hi all,
I'm trying to do an differential expression analysis on RNA-seq data. I have data from controls, patients with a mutation in gene A, and patients with a mutation in gene B. Each of the patient groups have slightly different mutations in their respective genes and this has been known to cause a slightly different phenotype.
My overall aim to see how genes are differentially expression between each group and within the mutation groups, given that each point mutation is different. Since the data is real patient data and for obvious reasons I am unable to obtain biological replicates for each specific type of mutation, I've grouped the data by controls, patients with mutations in gene A, and patients mutations in gene B. I've accounted for the conditions and gender in my designs.
However MDS plots and PCA both show that the mutation groups aren't clustering (which is to be expected given the mutations in the genes are slightly different). I wanted to run glmLRT () from edgeR to perform a DE analysis between Control vs Gene A group, Control vs Gene B group and Gene B vs Gene A but I'm not sure if this is the best way to find what I'm looking for.
I would really like some advice on what differential expression pipeline would be the best for what I'm trying to do, or if glmLRT() from edgeR would suffice? I've been looking through previous posts on Biostars as well and haven't found anything. If I have missed anything, please do share the link!
TL;DR I have 3 groups: controls (7 biological replicates), group with differing mutations in gene A (3 samples, each with different mutation), group with differing mutations in gene B (4 samples, each with different mutation) and no biological replicates for the mutations. What is the best design and pipeline to perform DE analysis given that each mutation is different?
One important thing you can do in your analysis is to try to estimate the number and identity of co-variates in your samples. The aim would be to control for other variables that influence the gene expression, beside the specific mutation you're interested in (for example, as you mentioned, mutation in other genes). You can do this with the R package sva.
Another important point is that this kind of analysis works well if you have high numbers of patients, say 50, while it's typically extremely underpowered if you have a handful.
Thank you! Will give sva a shot!
Unfortunately we only have a handful of patient data (3 for mutations in gene A and 4 for mutations in gene B)
You can check this paper : Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes, Nature Communications, 2015 : https://www.nature.com/articles/ncomms6901