Question

Many paralogous genes in GO enrichment results

0

Entering edit mode

4.5 years ago

tianshenbio ▴ 180

I performed a GO enrichment analysis for a list of DE genes. In the end, I got ~20 GO terms enriched. When I check the DE gene names under each GO term, it appears that almost all the genes have the same names or isoforms so it is reasonable that they have the same 'GO labels'...What I expect in a GO enrichment test is that a spectrum of different genes belonging to a common category would cluster together, in my case it is not...

I am totally lost at this point. Does it mean my result is not informative at all since it only reflects the DE pattern of a very limited number of genes? Is there a way to improve this?

RNA-Seq GO enrichment gene • 877 views

ADD COMMENT • link 4.5 years ago by tianshenbio ▴ 180

0

Entering edit mode

Based on what you describe, it would seem that the outcome is biased by redundancy, i.e. the same gene is represented by multiple IDs. This could be the case if you're working with transcript IDs. Try working with gene IDs instead.

ADD REPLY • link 4.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

That's true... The problem is that I am working with a non-model organism. In the genome gff file I created, isoforms are annotated as distinct genes with distinct IDs...So I don' t know if I could pool them under the same gene...would be hard to do

ADD REPLY • link 4.5 years ago by tianshenbio ▴ 180

0

Entering edit mode

So I don' t know if I could pool them under the same gene

Yes, you can. You could try clustering sequences/isoforms into "genes".

ADD REPLY • link 4.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I am not sure how to do so...They have been annotated as distinct 'genes' in the gff file, no isoform tags, or anything I can use to distinguish them. The only thing I know is that they usually have adjacent ID numbers, and they are annotated with the same/similar gene names. Apparently I cant cluster them based on names since that would cluster distantly located paralogs as well.

ADD REPLY • link 4.5 years ago by tianshenbio ▴ 180

0

Entering edit mode

Even if you don't have the sequences, the GFF file should contain start and end positions so you could cluster things that overlap. Hard to tell what's possible or not without knowing what kind of data you have access to.

ADD REPLY • link 4.5 years ago by Jean-Karim Heriche 27k