I am trying to understand the cegma's report call: .completeness_report
I am focusing on the column #Prots (Prots = number of 248 ultra-conserved CEGs present in genome). For example I obtained 237 (in partial) which mean that in 248 ultra-conserved CEGs 237 are predicted in my genome.
In the other output files, the number of protein is much more because it contain all of the KOG (not 237 but 458 protein). But I am just interest in the ultra-conserved CEGs so I had filter all of my files (the reference 248 ids come from the file completeness_cutoff.tbl in cegma/data) and I was expected to generate an output with 237 ids, but I obtained 234 proteins ! How is it possible?
My second question is: What is the .number (e.g. KOG0002.2) after KOG ids in cegma output?
I had a lot of trouble understanding CEGMA's output, and like in your case, reading their documentation and various pages has not helped. From their completeness report, I was interested in finding out exactly what KOGs were found in my dataset, and where were they found. However, the outputted files did not match in number of KOGs with the completeness report. I ended up giving up... but if someone has a little more knowledge on this I am still interested.
Question:- What is the .number (e.g. KOG0002.2) after KOG ids in cegma output?
Answer:- The number after KOG id represent the second region considered by BLAST for KOG0002. You can find the detail description in CEGMA FAQ [http://korflab.ucdavis.edu/Datasets/cegma/faq.html#link6] question 5 What do the numerical suffix on KOG IDs represent?
However, for the first question - Could you please describe how did you filter?
In my completness report it's written that I have 237 partial proteins predicted in my genome. But in all of my output I have 438 proteins ( wc -l myFile.cegma.id). So I took the 248 references ids in cegma/data/completeness_cutoff.tbl and I just keep them in my output. I was surprise because I didn't find 237 proteins in common between my output and completeness_cutoff.tbl as predicted, but 234.
There are two sets of core eukaryotic genes (CEGs), a larger set (458 CEGs) that are designed to be used to help train a gene finder in novel genomes. All of the CEGMA output except the completeness report file refer to this larger set of core genes.
A subset of the 458 CEGs can be used to assess the completeness of the gene-space of your target genome. These 248 CEGs are taken from the larger set but CEGMA uses slightly different filtering criteria to determine whether these are present. So it is possible for CEGMA to report a CEG being present in the set of 458 CEGs but NOT in the subset of 248 CEGs.
Your original question refers to partial CEGs, these are candidate core genes that exceed a score threshold but which do not exceed a length threshold to be considered 'complete' (this is a somewhat arbitrary threshold... is 95% of a gene complete... how about 85%?). Your genome may contain many partial core genes none of which are complete and so none of which will be present in the other CEGMA output files.
ADD COMMENT
• link
updated 2.4 years ago by
Ram
44k
•
written 9.6 years ago by
keith
▴
130
0
Entering edit mode
is there any way to get the sequences of the 248 CEGs reported in the completeness report, both partial and full length ones?
Not directly from CEGMA. You can turn on verbose mode to get extra output from the last step which would give you information to use in a custom script to then go back and extract the sequences.
I had a lot of trouble understanding CEGMA's output, and like in your case, reading their documentation and various pages has not helped. From their completeness report, I was interested in finding out exactly what KOGs were found in my dataset, and where were they found. However, the outputted files did not match in number of KOGs with the completeness report. I ended up giving up... but if someone has a little more knowledge on this I am still interested.