Entering edit mode
7.8 years ago
keepclam
▴
10
I just started working on two already annotated transcriptomes; they have been annotated with BLASTP, InterProScan, HMMER and GO terms. I consistently find problems in retrieving functional information (e.g., retrieve all the endonucleases) from annotation results, and I always end up performing a keyword search in terminal with grep and similar tools, which gives me partial results.
Is there any smarter and more biologically correct way to browse annotation results?
What format are the annotations in? Genbank/GFF or just text without format?
No, simple tab-separated text. One line of InterPro output looks like this:
while one line of HMMER output looks like this:
When you say you want to "browse" what are you expecting out of that? Do you need a summary of all different types of domains identified? Are you interested in knowing how many loci have no identifiable function?
Potentially you could use
awk
to cut columns out of these file followed by some sort of sorting to classify the results.I'd like to retrieve all proteins belonging to a given group of interest. Let's suppose I want to retrieve all nucleases. If I <grep> "nuclease", I automatically exclude from my results all those nucleases that don't have "nuclease" in their annotation. Is there any means to circumvent this problem? Note that InterProScan and HMMER give database IDs of their results (Pfam for HMMER and varous dbs for InterProScan).