I performed a GO enrichment analysis for a list of DE genes. In the end, I got ~20 GO terms enriched. When I check the DE gene names under each GO term, it appears that almost all the genes have the same names or isoforms so it is reasonable that they have the same 'GO labels'...What I expect in a GO enrichment test is that a spectrum of different genes belonging to a common category would cluster together, in my case it is not...
I am totally lost at this point. Does it mean my result is not informative at all since it only reflects the DE pattern of a very limited number of genes? Is there a way to improve this?
Based on what you describe, it would seem that the outcome is biased by redundancy, i.e. the same gene is represented by multiple IDs. This could be the case if you're working with transcript IDs. Try working with gene IDs instead.
That's true... The problem is that I am working with a non-model organism. In the genome gff file I created, isoforms are annotated as distinct genes with distinct IDs...So I don' t know if I could pool them under the same gene...would be hard to do
Yes, you can. You could try clustering sequences/isoforms into "genes".
I am not sure how to do so...They have been annotated as distinct 'genes' in the gff file, no isoform tags, or anything I can use to distinguish them. The only thing I know is that they usually have adjacent ID numbers, and they are annotated with the same/similar gene names. Apparently I cant cluster them based on names since that would cluster distantly located paralogs as well.
Even if you don't have the sequences, the GFF file should contain start and end positions so you could cluster things that overlap. Hard to tell what's possible or not without knowing what kind of data you have access to.