Transcript-Level Versus Gene-Level Go Enrichment Analysis (For Non-Model Organism)
1
13
Entering edit mode
12.2 years ago
dsbreak ▴ 170

I have a basic question about what test/reference sets can be used for GO enrichment analysis. All of the studies I come across ask whether certain gene subsets are enriched for a GO term. Is it appropriate to ask if a transcript subset is enriched? Or would that lead to some skewing of the statistics for/against genes with multiple isoforms?

I ask because I am working with a non-model organism (i.e. I need to do my own GO annotation) and would like to know if any of the genes/transcripts that are differentially expressed between two conditions are enriched for specific GO terms. I have a draft genome, a draft transcriptome (annotated using blast2go), and mRNA-Seq data. However, I find that there are several situations where a given gene with multiple isoforms has different GO-terms associated with each isoform.

My specific questions:

  1. Is it appropriate to do transcript-level GO enrichment analysis?
  2. Any references to studies that have done this successfully before?
  3. Alternatively, I could run a gene-level analysis if someone could suggest how to "collapse" different isoforms into a single sequence for use as input for blast2go :)
go enrichment transcript isoform • 12k views
ADD COMMENT
16
Entering edit mode
12.2 years ago

I think you have answered your own question when you observe that there are many genes with multiple transcript/protein isoforms where each isoform has different GO annotations. This is because the Gene Ontology attaches terms from it's three ontologies (molecular functions, biological processes and/or cellular components) to gene products, not genes. In other words, terms are associated with specific protein isoforms. In many cases people have information only at the gene-locus level (e.g., their expression arrays don't do a good job of measuring specific transcripts) or if they have transcript-level data they map those transcripts to the gene-locus level rather than the protein isoform level. However, if you do have good transcript-level data I would argue that it is better to map those to the corresponding protein isoform (e.g., UniProt) and use that as input for your Gene Ontology analysis. Most GO over-representation software will allow you to upload your own "total/complete" lists from which your protein subset was derived. This will prevent the skewing of statistics that you are quite wisely concerned about. As an illustrative example, check out DAVID. Choose their 'Functional Annotation' (gene-annotation enrichment analysis) tool and you will see that you can upload many different types of transcript IDs or protein IDs for both your "gene list" and "background" list of interest. Running their statistics will tell you which GO terms are over-represented in your subset of transcript/protein IDs relative to the total/background list. Most GO enrichment tools will follow this pattern. You can explore a list maintained by GO here. All of this was a really long way of answering your first question: YES - it is appropriate to do transcript-level GO enrichment analysis. For your second question, there must be many references for this. Unfortunately, it is so common now that most people don't really explain what they have done in their publications. For your third question, given the above, I would not "collapse" different isoforms.

Your situation of not having a model organism creates a lot more challenges. I've never worked with blast2go. But I suppose if you have a complete set of transcripts, get some functional annotations for many of them from blast2go, then you should be able to build your own transcript-annotation database and use that for over-representation analysis of subsets of genes versus the total list. This will likely require custom analysis as opposed to tools like DAVID. I suggest you investigate Bioconductor packages like GOStats. They actually have a short vignette for your situation. This thread looks really helpful for someone trying to figure that vignette out for the first time.

ADD COMMENT
0
Entering edit mode

Thanks for pointing out that the Gene Ontology Consortium emphasizes gene products throughout it's website (though the original 2000 Nature Genetics was not as precise). For book keeping, here's some useful information about how the GOC has thought about dealing with gene level vs gene product level information:

http://wiki.geneontology.org/index.php/Annotation_of_Alternate_Spliceforms

I understand that various programs allow one to "customize" the background reference set (e.g. blast2go). I still wonder, however, whether taking alternative splicing information into account during gene enrichment analysis results in better/worse biological insight. So if anyone is aware of a study...

ADD REPLY
0
Entering edit mode

Throughout this post, I couldn't get how is it possible to accociated specific GO term to specific splice isoform of a gene.

ADD REPLY
0
Entering edit mode

blast2go is on the verge of getting commercialized (as they have started selling PRO versions) and my previous experince was not so good with it. I prefered transcript level GO enrichment as it was more informative and meaningful to do, with domain based InterPro predicted GO terms. To use these custom annotation was tricky for visualisation, but thanks to BiNGO, I was able to do it flawlessly. For future use I have documented it here, http://infoplatter.blogspot.in/2014/04/gene-ontology-go-enrichment-analysis-in.html

ADD REPLY
0
Entering edit mode

Hiya,

Sorry to post here, but I posted this question on another post and then saw this one:

I have a blast database for GO terms in blast2go which includes IDs for all the isoforms of the genes. When I make my gene lists for GO enrichment analysis, ithe list compiler pulls the IDs of all the isoforms associated with the genes of interest (DEG). My question is : Should I

(a) De-duplicate the list so just one ID per gene is input into the GO enrichment analysis

or

(b) Submit the full list containing the IDs of all the isoforms for each gene of interest?

I have run both and the de-duplicated list as I anticipated contains less GO terms than the full list containing all the isoforms.

I feel like it is correct to run the full list of IDs (option b) because otherwise the enrichment test could be negatively biased by terms where there are lots of isoforms present in the database, but only one is submitted - making it look like the GO term is less enriched than it actually is (I hope that makes sense). On reading the above answer I feel like this is the correct way to run enrichment rather than collapsing the list, just wanted to check I understood your reply to above question correctly.

Best wishes and any opinions/advice are greatly appreciated,

Rebekah

ADD REPLY

Login before adding your answer.

Traffic: 2062 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6