Question

Find GO terms from list of protein sequences

0

Entering edit mode

2.5 years ago

tpritsky • 0

Hi, I have a large list of human protein sequences (~1000) and am trying to get the gene ontology (GO) term names, accession numbers, and domains as a list for each protein. I'm looking for help in setting up a pipeline to generate these outputs. I'm comfortable with python and bash scripting.

Here's my current thinking:

Use BLAST alignment and capture the best fit result for each sequence and find the associated protein identifier.
Use the protein identifiers to extract associated GO terms for each sequence.
Export a database (eg. csv) of protein sequences with their associated GO term names (eg. hydrolase activity), accession numbers (eg. GO:0016787), and domains (eg. molecular_function).

My main questions are:

How do I set up an efficient BLAST alignment search to find the primary protein identifier for ~1000 sequences (will likely need to download a BLAST alignment database locally for efficiency)?
What protein identifier do I need to generate from this blast alignment, so it can be used in later steps?
What tool do I use to extract GO results from this protein identifier? How can I do this efficiently? I saw that biomart was a good option but am uncertain what protein identifier to use -- lots of identifier options available but none seem to match the ones returned by uniprot BLAST alignment.
How do I consolidate outputs into an easy to use format for downstream applications (eg. csv file)?

Thanks again for your help. The end goal is to produce a meta-learning dataset for gene ontology.

protein ontology uniprot gene entrez • 1.9k views

ADD COMMENT • link updated 2.5 years ago by Leite ★ 1.3k • written 2.5 years ago by tpritsky • 0

score 0 · Answer 1 · 2022-11-14

I did something similar some time ago for the entire protein sequence set of species using InterProScan. I downloaded this tool locally and managed to run it using some preliminary bash script (most of which was centred around parallelising the code on HPCs). From my limited understanding of how InterProScan works is that it does precisely what you are describing - Take in a protein sequence as input, identify the conserved domains using the same logic as BLAST, and highlight the GO terms (per domain) as output. Every "hit" is assigned a p-value so you can filter on your preference.

score 0 · Answer 2 · 2022-11-15

0

Entering edit mode

2.5 years ago

Leite ★ 1.3k

Dear tpritsky,

You can try using the String DB, in Multiple Proteins by Sequences searche mode:

https://string-db.org/cgi/input?sessionId=bgaq8Mi1BaWA&input_page_active_form=multiple_sequences

With this you will construct a network and also be able to finde the biological processes.

Best Leite

ADD COMMENT • link 2.5 years ago by Leite ★ 1.3k