I have a list of Uniprot gene IDs associated with Gene Ontology (biological processes), which I have obtained from Uniprot.org. I am showing only one gene ID with associated the biological processes -- because the other genes have a lengthy biological process.
O95831 activation of cysteine-type endopeptidase activity involved in apoptotic process; apoptotic DNA fragmentation; apoptotic process; cell redox homeostasis; chromosome condensation; DNA catabolic process; intrinsic apoptotic signaling pathway in response to endoplasmic reticulum stress; mitochondrial respiratory chain complex I assembly; NAD(P)H oxidase activity; neuron apoptotic process; neuron differentiation; oxidoreductase activity, acting on NAD(P)H; positive regulation of apoptotic process; regulation of apoptotic DNA fragmentation.
Problem: Figure out a way to text mining the biological process that is related mitochondria (where mitochondria is mentioned). Would regex be useful to solve this problem? or what other ways that might be useful?
Expected Result: the result that I want to get is the following:
O95831 mitochondrial respiratory chain complex I assembly
Your help is appreciated
This must be helpful! Since my goal is to annotate about ~1000 genes, do you think it would be possible to query more than one Uniprot ID to find the corresponding annotation? Or this should performed programmatically?
Yes, you can do large-scale analysis. Actually, the query I linked in my previous comment will give you all the genes in Human that are annotated as mitochondrial by GO (>7,000 records) from SwissProt and trEMBL. You need to modify the query and output fields to limit the exact result you want. Then you parse the results which are provided in several formats and select the UniProt IDs you are interested in.
Or, you can upload your list of Uniprot IDs to MitoMiner and save it as a list. Then you can use that list as a query.
Thank you a lot ... Your comment is so valuable!
You are welcome. Glad to help.