Dear colleagues,
I've done an extensive search on Google as well as the forums on Seqanswers and Biostars and don't believe this question has been covered (in this depth) before.
I've used WebMGA (http://weizhong-lab.ucsd.edu/metagenomic-analysis/server/kog/) for functional annotation (KOG) of my transcriptome (FASTA) file. However, I need some help interpreting/integrating these results.
Briefly, I did a de novo assembly and annotated it using dammit. I used this as input for WebMGA and it outputs a number of files. With the help of a previous Biostars post (which links to this stackoverflow post) I was able to use one of the output files (output.2.class: counts by class) to make a histogram of KOG classes from my assembly.
However, I would also like to append the KOG results (i think from output.2: long table of rpsblast hits?) onto my transcriptome. So in addition to gene names that were extracted from UniRef90 (and appended) to my transcriptome from dammit I want KOG information as well.
I understand I will likely need to write a custom script to this; however, I have very little scripting experience. For example, I was fairly easily able to make changes to the stackoverflow R code because I've done most of my programming using R. There must be a simple way to do this on the linux command line? If someone has experience attaching functional annotations (KOG or even GO) to assemblies I would really appreciate your support and tutelage. Similarly, if a code exists for a similar (but not identical task) I would be willing to try and cannibalizing/re-purpose the script.
Regards,
A frustrated but motivated student
P.S. To show you that I have been thinking about how to go about this let me describe what I want (albeit in pseudo-code)
Script should:
1) Look at FASTA file and KOG output (in the first column $1) and when sequences match
2) extract subset of data from KOG output (description column $11) with awk
and append it to the end of that header in the FASTA file
Thank you very much for the quick response Philipp!
Unfortunately, to date my experience has been in the shell and R programming. So I'm currently unfamiliar with Python. Although shell is very easy and quick, once the pattern gets complex it appears you should use Python rather than bashing your head against the wall to come up with a solution using pure shell.
Biopython looks like an amazing resource for bioinformatics, so thank you for the suggestion - I will be taking this opportunity to learn it.
I tried to get my feet-wet by opening the python interpreter on AWS EC2 Ubuntu 14 and running this script (both interactively and as a .py script). There were no errors (I'm using Python2) but the output didn't have the KOG terms attached to the FASTA.
I'm not sure where the problem lies. However, one potential future issue I see is
assert gene_name not in kogs_dict # sanity check: we should have only one kog result per gene, I guess?
From a cursory view of my KOG results I noticed that there may be some genes with multiple KOG results (see Transcript_506 in KOG file)
I've attached test-files for anyone willing to tackle this: 1) Fasta: http://www.filedropper.com/fasta 2) KOG: http://www.filedropper.com/kog
I see! I slightly changed the code - I now use a defaultdict, which by default stores an empty list for any possible key, so I can just append to that.
I run it like this:
Code is:
Prints stuff like:
If there are several KOG terms I separated them using a semicolon, so we have two terms here.