I have a basic question about what test/reference sets can be used for GO enrichment analysis. All of the studies I come across ask whether certain gene subsets are enriched for a GO term. Is it appropriate to ask if a transcript subset is enriched? Or would that lead to some skewing of the statistics for/against genes with multiple isoforms?
I ask because I am working with a non-model organism (i.e. I need to do my own GO annotation) and would like to know if any of the genes/transcripts that are differentially expressed between two conditions are enriched for specific GO terms. I have a draft genome, a draft transcriptome (annotated using blast2go), and mRNA-Seq data. However, I find that there are several situations where a given gene with multiple isoforms has different GO-terms associated with each isoform.
My specific questions:
- Is it appropriate to do transcript-level GO enrichment analysis?
- Any references to studies that have done this successfully before?
- Alternatively, I could run a gene-level analysis if someone could suggest how to "collapse" different isoforms into a single sequence for use as input for blast2go :)
Thanks for pointing out that the Gene Ontology Consortium emphasizes gene products throughout it's website (though the original 2000 Nature Genetics was not as precise). For book keeping, here's some useful information about how the GOC has thought about dealing with gene level vs gene product level information:
http://wiki.geneontology.org/index.php/Annotation_of_Alternate_Spliceforms
I understand that various programs allow one to "customize" the background reference set (e.g. blast2go). I still wonder, however, whether taking alternative splicing information into account during gene enrichment analysis results in better/worse biological insight. So if anyone is aware of a study...
Throughout this post, I couldn't get how is it possible to accociated specific GO term to specific splice isoform of a gene.
blast2go is on the verge of getting commercialized (as they have started selling PRO versions) and my previous experince was not so good with it. I prefered transcript level GO enrichment as it was more informative and meaningful to do, with domain based InterPro predicted GO terms. To use these custom annotation was tricky for visualisation, but thanks to BiNGO, I was able to do it flawlessly. For future use I have documented it here, http://infoplatter.blogspot.in/2014/04/gene-ontology-go-enrichment-analysis-in.html
Hiya,
Sorry to post here, but I posted this question on another post and then saw this one:
I have a blast database for GO terms in blast2go which includes IDs for all the isoforms of the genes. When I make my gene lists for GO enrichment analysis, ithe list compiler pulls the IDs of all the isoforms associated with the genes of interest (DEG). My question is : Should I
(a) De-duplicate the list so just one ID per gene is input into the GO enrichment analysis
or
(b) Submit the full list containing the IDs of all the isoforms for each gene of interest?
I have run both and the de-duplicated list as I anticipated contains less GO terms than the full list containing all the isoforms.
I feel like it is correct to run the full list of IDs (option b) because otherwise the enrichment test could be negatively biased by terms where there are lots of isoforms present in the database, but only one is submitted - making it look like the GO term is less enriched than it actually is (I hope that makes sense). On reading the above answer I feel like this is the correct way to run enrichment rather than collapsing the list, just wanted to check I understood your reply to above question correctly.
Best wishes and any opinions/advice are greatly appreciated,
Rebekah