Hi, I have a large list of human protein sequences (~1000) and am trying to get the gene ontology (GO) term names, accession numbers, and domains as a list for each protein. I'm looking for help in setting up a pipeline to generate these outputs. I'm comfortable with python and bash scripting.
Here's my current thinking:
- Use BLAST alignment and capture the best fit result for each sequence and find the associated protein identifier.
- Use the protein identifiers to extract associated GO terms for each sequence.
- Export a database (eg. csv) of protein sequences with their associated GO term names (eg. hydrolase activity), accession numbers (eg. GO:0016787), and domains (eg. molecular_function).
My main questions are:
- How do I set up an efficient BLAST alignment search to find the primary protein identifier for ~1000 sequences (will likely need to download a BLAST alignment database locally for efficiency)?
- What protein identifier do I need to generate from this blast alignment, so it can be used in later steps?
- What tool do I use to extract GO results from this protein identifier? How can I do this efficiently? I saw that biomart was a good option but am uncertain what protein identifier to use -- lots of identifier options available but none seem to match the ones returned by uniprot BLAST alignment.
- How do I consolidate outputs into an easy to use format for downstream applications (eg. csv file)?
Thanks again for your help. The end goal is to produce a meta-learning dataset for gene ontology.