how to check if DE genes are enriched in custom genomic regions derived from epigenetic data
1
0
Entering edit mode
15 months ago
Al • 0

I haven't found any relevant posts, so here is my question (it's more a question on statistics I believe):

I have 2 samples to compare ( control vs condition), and I had RNA-seq data that I analysed using Deseq2 to find differentially expressed genes (DEGs) . Moreover, I have some epigenetic data, from which I derived characteristic groups of chromatin ( group a, group b, group c). In other words, I have bed files of the genomic regions of different sizes that fall into these 3 groups. The full genome is divided in these 3 groups, but they have different sizes ( both individual regions might vary in size and the fraction of genome per group).

Now my question is: How to check if there is more DEGs present in group a compared to other groups? I really don't know how to approach this question from the statistical point of you. There are multiple factors I am confused about:

  1. groups (a,b,c) are not equal in size and they are not evenly distributed across the genome, so just checking if there is more DE genes in one group over another wouldn't be accurate because it doesn't take it into the account;
  2. it is possible that there are in general more genes present in one group compared to another, and I don't know how to take it into the account for this neither.

I consider 1000 DEGs for analysis, and it is a human genome. I look at both protein coding and not coding genes. I would probably also be interested in seeing if there are different patterns for up-regulated and down-regulated genes.

Could anyone help me to understand how I should approach this question? Should I perform some sort of statistical testing here, if yes what kind? Could anyone point me to some resources (books, tutorials, courses etc) that could give me an idea how to approach this kind of questions. What section of statistics I should dive into?

I don't know if this could be considered multiomics data integration, but if yes I would be glad for recommendations of the materials on this subject too.

Many thanks for all the replies and help, hope my question was clear enough!

statistics multiomics epigenetics • 931 views
ADD COMMENT
2
Entering edit mode
15 months ago
LauferVA 4.5k

Al ,

first head to the msigdb website, and take a look at the annotations that belong to set C1: https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp#C1. Note that, while C1 contains gene sets defined based on position, the others are quite different (e.g. C2 is "curated", C8 is "cell type signatures") and so forth. the key here is to realize that what you are describing is just another way of defining a gene set...and as such is subject to the same types of methods others have used for gene sea enrichment analysis, or GSEA.

step 1, define gene sets: It looks like youve gotten this far. you need to first construct your gene sets of interest by dividing all genes into an in and outgroup. this is very similar to what was done in C1, except instead of lists of genes demarcated by cytobands, you are thinking of defining groups of genes based on genomic positions corresponding to epigenetic states from your data. as you have aptly put, the number (and identity) of genes not belonging to this group are important too, more on that below.

step 2, select a statistical methodology: you are describing enrichment testing; among the simplest instantiations of which is the hypergeometric test. this is a good place to start reading. over time, people began to further adapt and tailor these; for instance, the bioinformatics/biostatitical literatures detail weighted hypergeometric tests, directed hypergeometric tests, etc.; granted your treatment of the subject above, these may be of some interest to you. ultimately, the field converged on a small number of relatively robust gene set enrichment tests, for instance gsea, and others.

step 3, select an implementation of the method chosen in 2, or implement your own: im going on record here as saying you should not implement something like this on your own - there is just no reason to. this is a well-studied problem, and very well articulated software packages already implement the tests you'd likely choose. as a training exercise could be worthwhile, but if your goal is just to get the science done correctly, id recommend choosing a well-vetted implementation rather than re-inventing the wheel. others have stronger ideas on what the best implementations are than i do; for instance ATpoint has written very thorough posts on this in the past.

further, more sophisticated analyses are certainly possible, based on network theory (e.g. aracne) or based on other ideas from integrated omics. For the latter, although this link pertains specifically to single cell data, it should give you some ideas nonetheless.

that help?

ADD COMMENT
0
Entering edit mode

hey @LauferVA, LauferVA

Thanks a ton for your prompt reply! That indeed was very helpful! If you don’t mind, I still would like to ask you a couple of question as I am learning how to approach the matter. thanks in advance!!

I have checked the gene sets types as you proposed, and I see that indeed there are many ways to define a gene set. However, I am getting confused here and a bit lost. Let’s say, I have the deseq2 output of DE genes( around 25000 genes) , and among them 1000 are significantly differentiated(and I want to check them). Now, I should group these 25000 genes by the epigenetic states of my data( as I mentioned, i divide the full genome into 3 groups and I also have 4th group which are regions whose group I couldn’t identify due to lack of coverage etc ). So now I have my 25,000 genes where roughly 12000 belong to group a, 8000 belong to group 4 unidentified, 4500 belong to group b and 500 belong to group c. Now, what i do , from these 25000 genes I define my 4 gene sets where each group is one term so 4 in total, then i take my 1000 DEGs , i rank them and then I use prerank approach to get the enrichment results. I decided to go with GSEA approach and implementation. Here are my questions I am not so sure about:

  1. as you see, gene sets by epigenetic state that I use are really variable in terms of size, while in GO they seem to be more or less in the similar range, same for the case of C1 demarcated by cytobands. do you think it poses a problem, and if yes what would be the correct approach here?
  2. as i mentioned earlier, because there are many genes that i cannot classify into any of my groups, i use it as a separate group. what do you think of it? it’s pretty large so i can’t just ignore it, but i am not sure how exactly it would affect analysis. I’d like to know and i would greatly appreciate your opinion as an expert.
  3. as you can see, i’ve chosen all genes detected by deseq2 for gene sets, and significant ones as my ranked list. Do you think it’s a correct approach in my case?
    1. Once I ran the analysis , the output gave me only 2 lines with group None as a top line and group b as a second line, and there were no other groups. I am kinda worried if it’s reliable as I can clearly see that in classical usage cases the gene sets have much more terms and the output is bigger too. Do you think there is an issue here?
    2. One last question is concerning separation of upregulated/downregulated genes. When i rank my genes, I use ‘stat’ as a variable to rank by. so i have the most upregulated genes at the top and the most downregulated genes at the bottom. Now i wonder if it’s a common practice to rather run separate analyses for upregulated and down regulated genes? I suspect there might be different behaviour, but I don’t have any proofs on that. Or maybe if I rank genes rather by adjusted p value it would resolve the question, what do you think?

Sorry for lots of questions again, but as I am diving into this subject now, I still feel pretty unsure and i would greatly appreciate your help and tips. I deeply appreaciate your will to help, I unfortunately work alone and don’t have colleagues to discuss this subject with anyone. Thanks a lot for your help!!

best regards, Al

ADD REPLY
0
Entering edit mode

update: LauferVA

I have tested a bit hypergeometric test first and prerank approaches, but, in first case I had only one group having significant adjusted p-value while the rest were 1.000000e+00.. however, it’s also the biggest group so I suspect it’s just the problem of statistics rather than the actual result. In case of prerank too, I get quite big FDR values for the output, and I think it’s due to the fact that my gene sets are either very huge or very little, so maybe GSEA implementation of prerank isn’t really a good fit in my case. I am not sure what's wrong here now

ADD REPLY

Login before adding your answer.

Traffic: 1663 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6