Dear Biostars, Hi ( I'm not native in English so, be ready for some possible language flaws).
The COG histogram ( despite of its usefulness or not) are shown in many papers. (e.g here from this)
for drawing this histogram I have used NCBI COG data link and downloaded the "prot2003-2014.fa.gz" file.
Then I have uncompressed it and using makeblastdb, modify it as a blastable protein database.
Then I hvae used blastX and my de novo transcriptome assembly fasta file as -query.
but in the result of blastX, that I will show you two lines of it at the end of this post, was not something that could be used (easily) for creating such eye-catching Histogram! .
.
Please review my strategy for COG annotation of my transcriptomes and if it is correct, guide me about converting the blast output to a COG histogram (if there is any related standalone fast software I will appreciate that, too!)
.
below is examples of two lines of my blastx output against COG protein database:
TRINITY_DN212758_c0_g1_i1....... gi|383763379|ref|YP_005442361.1| 52.830 106 49 1 3 320 415 519 2.82e-23 97.4
TRINITY_DN212791_c0_g1_i1........gi|111021329|ref|YP_704301.1| 81.081 74 14 0 2 223 53 126 3.40e-24 98.6
You could code, you need to parse the few files to generate data for histogram. map
gi ids
from blast to the cog2003-2014.csv to getCOGID
(ex: COG0001). map this COGID in cognames2003-2014.tab to get the major COG class. Calculate the frequency of each class for the plotThank you Prasad, but unfortunately I am not very good in coding and have no idea about the "map"ping procedure you have kindly mentioned.
as you have mentioned, predict the longest ORF for each transcript and you can use the protein sequences which much easier