I haven't found any relevant posts, so here is my question (it's more a question on statistics I believe):
I have 2 samples to compare ( control vs condition), and I had RNA-seq data that I analysed using Deseq2 to find differentially expressed genes (DEGs) . Moreover, I have some epigenetic data, from which I derived characteristic groups of chromatin ( group a, group b, group c). In other words, I have bed files of the genomic regions of different sizes that fall into these 3 groups. The full genome is divided in these 3 groups, but they have different sizes ( both individual regions might vary in size and the fraction of genome per group).
Now my question is: How to check if there is more DEGs present in group a compared to other groups? I really don't know how to approach this question from the statistical point of you. There are multiple factors I am confused about:
- groups (a,b,c) are not equal in size and they are not evenly distributed across the genome, so just checking if there is more DE genes in one group over another wouldn't be accurate because it doesn't take it into the account;
- it is possible that there are in general more genes present in one group compared to another, and I don't know how to take it into the account for this neither.
I consider 1000 DEGs for analysis, and it is a human genome. I look at both protein coding and not coding genes. I would probably also be interested in seeing if there are different patterns for up-regulated and down-regulated genes.
Could anyone help me to understand how I should approach this question? Should I perform some sort of statistical testing here, if yes what kind? Could anyone point me to some resources (books, tutorials, courses etc) that could give me an idea how to approach this kind of questions. What section of statistics I should dive into?
I don't know if this could be considered multiomics data integration, but if yes I would be glad for recommendations of the materials on this subject too.
Many thanks for all the replies and help, hope my question was clear enough!
hey @LauferVA, LauferVA
Thanks a ton for your prompt reply! That indeed was very helpful! If you don’t mind, I still would like to ask you a couple of question as I am learning how to approach the matter. thanks in advance!!
I have checked the gene sets types as you proposed, and I see that indeed there are many ways to define a gene set. However, I am getting confused here and a bit lost. Let’s say, I have the deseq2 output of DE genes( around 25000 genes) , and among them 1000 are significantly differentiated(and I want to check them). Now, I should group these 25000 genes by the epigenetic states of my data( as I mentioned, i divide the full genome into 3 groups and I also have 4th group which are regions whose group I couldn’t identify due to lack of coverage etc ). So now I have my 25,000 genes where roughly 12000 belong to group a, 8000 belong to group 4 unidentified, 4500 belong to group b and 500 belong to group c. Now, what i do , from these 25000 genes I define my 4 gene sets where each group is one term so 4 in total, then i take my 1000 DEGs , i rank them and then I use prerank approach to get the enrichment results. I decided to go with GSEA approach and implementation. Here are my questions I am not so sure about:
Sorry for lots of questions again, but as I am diving into this subject now, I still feel pretty unsure and i would greatly appreciate your help and tips. I deeply appreaciate your will to help, I unfortunately work alone and don’t have colleagues to discuss this subject with anyone. Thanks a lot for your help!!
best regards, Al
update: LauferVA
I have tested a bit hypergeometric test first and prerank approaches, but, in first case I had only one group having significant adjusted p-value while the rest were 1.000000e+00.. however, it’s also the biggest group so I suspect it’s just the problem of statistics rather than the actual result. In case of prerank too, I get quite big FDR values for the output, and I think it’s due to the fact that my gene sets are either very huge or very little, so maybe GSEA implementation of prerank isn’t really a good fit in my case. I am not sure what's wrong here now