Question

how to draw COG Histogram ?

1

Entering edit mode

8.5 years ago

Farbod ★ 3.4k

Dear Biostars, Hi ( I'm not native in English so, be ready for some possible language flaws).

The COG histogram ( despite of its usefulness or not) are shown in many papers. (e.g here from this)

for drawing this histogram I have used NCBI COG data link and downloaded the "prot2003-2014.fa.gz" file.

Then I have uncompressed it and using makeblastdb, modify it as a blastable protein database.

Then I hvae used blastX and my de novo transcriptome assembly fasta file as -query.

but in the result of blastX, that I will show you two lines of it at the end of this post, was not something that could be used (easily) for creating such eye-catching Histogram! .

.

Please review my strategy for COG annotation of my transcriptomes and if it is correct, guide me about converting the blast output to a COG histogram (if there is any related standalone fast software I will appreciate that, too!)

.

below is examples of two lines of my blastx output against COG protein database:

TRINITY_DN212758_c0_g1_i1....... gi|383763379|ref|YP_005442361.1| 52.830 106 49 1 3 320 415 519 2.82e-23 97.4

TRINITY_DN212791_c0_g1_i1........gi|111021329|ref|YP_704301.1| 81.081 74 14 0 2 223 53 126 3.40e-24 98.6

cog clusters of orthologous groups histogram blast • 4.4k views

ADD COMMENT • link updated 8.5 years ago by GenoMax 149k • written 8.5 years ago by Farbod ★ 3.4k

3

Entering edit mode

You could code, you need to parse the few files to generate data for histogram. map gi ids from blast to the cog2003-2014.csv to get COGID (ex: COG0001). map this COGID in cognames2003-2014.tab to get the major COG class. Calculate the frequency of each class for the plot

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

1

Entering edit mode

Thank you Prasad, but unfortunately I am not very good in coding and have no idea about the "map"ping procedure you have kindly mentioned.

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

0

Entering edit mode

as you have mentioned, predict the longest ORF for each transcript and you can use the protein sequences which much easier

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

score 1 · Answer 1 · 2016-09-17

1

Entering edit mode

8.5 years ago

Whoknows ▴ 960

Hi farbod,

You could do it with WebMGA , or EggNOG if you like to have newer version of KOG/COG for your data. For these database you need protein sequences of your blastx output. Those known protein sequences.

Hope it helps.

ADD COMMENT • link 8.5 years ago by Whoknows ▴ 960

1

Entering edit mode

Dear Whoknows, Hi and thank you

As you have said, both of them needs protein sequences (not IDs) that I do not have them now. of course I have run Transdecoder for my assembly fasta file that it has translated the longest ORF for each gene to protein sequence but I do not know that if it is useful or not.

On the other hand, the process of blastX-ing the transcriptome fasta file against COG protein is an easy task (if it is aright approach), but assigning the COGnames to the blast result and drawing the related Histogram is a little difficult for me.

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

1

Entering edit mode

You can blast them agains EggNOG database by nucleotide sequence. The easiest way is

Blastx transcriptome sequence agains SWISS-Prot proteins (download from UniProt)
Get blastx output subject IDs' sequences from UniProt
Put that file in WebMGA

The procedure of working with EggNOG database is almost same.

ADD REPLY • link 8.5 years ago by Whoknows ▴ 960

1

Entering edit mode

Dear Whoknows, I really appreciate the time you are spending for answering me,

would you please help me a little more about the number (2)?

I guess that this the step that I can collect the Protein sequence related to my transcripts but I do not know how to "Get blastx output subject IDs' sequences from UniProt" ?

imagine that this is my blast script : (is it good ?)

blastx -query Trinity.fasta -db uniprot_sprot.fasta -out blastx.outfmt6 \ -evalue 1e-6 -num_threads 20 -max_target_seqs 1 -outfmt 6

and this is two first line of blast output:

TRINITY_DN212758_c0_g1_i1 sp|P28723|FTHS_SPIOL 84.112 107 17 0 1 107 421 527 3.15e-59 194 TRINITY_DN212713_c0_g1_i1 sp|Q86Y33|CD20B_HUMAN 62.963 216 80 0 1 216 301 516 1.50e-96 294

what must I do then ?

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

2

Entering edit mode

just map P28723, Q86Y33 etc ids in uniprot and get the protein sequences. Use these sequences as input

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

1

Entering edit mode

Hi, Thanks a lot!

So, I think I must first create a text file containing each IDs (e.g P28723, Q86Y33 ) in each line.

then, is there any way that I can collect the protein sequence of this text file (IDs) from the Uniprot automatically?

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

1

Entering edit mode

mapping id here using default setting. then download fasta sequences

ADD REPLY • link 8.5 years ago by popayekid55 ▴ 110

0

Entering edit mode

I have used the IDs in several arrangement but it could not map them.

the default is from Uniparc to Uniprotkb

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

1

Entering edit mode

I have used the IDs in several arrangement

means? Put one id per line and map

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

1

Entering edit mode

I have input the IDs is the : "1-Provide your Identifier" section of that link as bellow :

P28723

Q86Y33

then I have click the "Go" on "2-select options" section.

and the result is this :

Sorry, no results were found.

but when I type this ID in the upper toolbar of the main site, it shows the result.

maybe the error is from "from Uniparc to Uniprotkb" part?

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

2

Entering edit mode

in select option keep from Uniprotkb AC/ID to Uniprotkb. here is mapped data using the settings told

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

1

Entering edit mode

Thank you, it works!

So, then I must use "Download" section, yes ?

It has two form of FASTA output, which one is better for WebMGA ? (canonical OR canonical & Isoform) ?

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

1

Entering edit mode

I have enterd the results in Excel but the histogram I have provided above is very colorful.

Did they use R (ggplot2) for their COG histogram ?

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

1

Entering edit mode

Dear Prasad, Hi

Is there any way to just collect the "fish" related results from this mapping process or even before that, when the Swissprot blastx IDS are being collected ?

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

1

Entering edit mode

you can download fish related proteins from uniprot and proceed as you have done it before.

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k

1

Entering edit mode

Dear Prasad, Hi.

1-Do you mean I can download a kind of Swissprot database containing only the fish proteins ? (could you please help me via any link address? )

2- Or, you mean after blastXing of my transcriptome against whole Swissprot, I can collect the so-called "fishy" results ? (how?)

Thank you in advance.

ADD REPLY • link 8.5 years ago by Farbod ★ 3.4k

2

Entering edit mode

select the taxonomy ID at what level (ex: here)you want the sequences and den download corresponding data

ADD REPLY • link 8.5 years ago by Prasad ★ 1.6k