Dear All,
I would like to ask you about the CEGMA output report. For example, what is the range of Average? I sometimes get 1.66 or 1.22. which values are important about genome assembly or genome?
Thanks
Dear All,
I would like to ask you about the CEGMA output report. For example, what is the range of Average? I sometimes get 1.66 or 1.22. which values are important about genome assembly or genome?
Thanks
You should be looking at the output.completeness_report file for interpreting CEGMA results. A sample output is pasted below:
# Statistics of the completeness of the genome based on 248 CEGs #
#Prots %Completeness - #Total Average %Ortho
Complete 217 87.50 - 308 1.42 30.41
Group 1 55 83.33 - 75 1.36 25.45
Group 2 51 91.07 - 63 1.24 19.61
Group 3 51 83.61 - 80 1.57 41.18
Group 4 60 92.31 - 90 1.50 35.00
Partial 243 97.98 - 427 1.76 49.38
Group 1 64 96.97 - 99 1.55 35.94
Group 2 56 100.00 - 89 1.59 41.07
Group 3 60 98.36 - 118 1.97 65.00
Group 4 63 96.92 - 121 1.92 55.56
# These results are based on the set of genes selected by Genis Parra #
# Key: #
# Prots = number of 248 ultra-conserved CEGs present in genome #
# %Completeness = percentage of 248 ultra-conserved CEGs present #
# Total = total number of CEGs present including putative orthologs #
# Average = average number of orthologs per CEG #
# %Ortho = percentage of detected CEGS that have more than 1 ortholog #
Here, there are 217
complete and 26
partial (i.e., 243 - 217 = 26)
core eukaryotic genes (out of total 248 genes) present in your assembly. Groups are just categorized core genes based on functional annotation (I guess). Normally, you don't have to worry about the %Ortho
or Average
(i.e., number of ortholgos per gene). It might matter if your genome is polyploid or something like that.
Another key number you might need is the number of sequences in output.cegma.dna
file (just do grep -c ">" output.cegma.dna
). This number will tell you how many of the total CEGMA genes (larger subset of 458 genes) are present. This one includes the 248 set as well. In my case it was 453. So, all my report needs is:
243 out of 248, 443 out of 458, CEGMA genes were predicted in the genome
I hope this helps.
is there any ideal cut off completeness value for transcriptome? can we combine complete and partial detected gene and represent.I have got cegma result for transcriptome, but don’t know whether this following result is acceptable. I would be very grateful if you could comment on my problem.
COMPLETENESS ASSESSMENT RESULTS: Total number of core genes queried 248 Number of core genes detected Complete 187 (75.40%) Complete + Partial 235 (94.76%) Number of missing core genes 13 (5.24%) Average number of orthologs per core genes 3.13 % of detected core genes that have more than 1 ortholog 94.12
Regards rahul
Thanks for your suggestion.I have done busco and got following result. Is there any cut off completeness value for Busco .
Completeness Assessment Results: Total # of core genes queried: 429 # of core genes detected Complete: 223 (51.98%) Complete + Partial: 327 (76.22%) # of missing core genes: 102 (23.78%) Average # of orthologs per core genes: 1.78 % of detected core genes that have more than 1 ortholog: 69.06 regards Rahul
Thanks for your suggestion.I have done busco and got following result. Is there any cut off completeness value for Busco .
Completeness Assessment Results: Total # of core genes queried: 429 # of core genes detected Complete: 223 (51.98%) Complete + Partial: 327 (76.22%) # of missing core genes: 102 (23.78%) Average # of orthologs per core genes: 1.78 % of detected core genes that have more than 1 ortholog: 69.06 regards Rahul
Please give more details about your transcriptome. Did you assembly RNA reads, if yes how? I mean did you sequence RNA or download SRA data then make assembly? Which k-met value did you use for assembly and which tool (Trinity etc)?. Which command did you use in BUSCO and which version of BUSCO (v2 ?)? Which database did you use in BUSCO command? Eukaryote? Also did you check contamination (bacteria, host etc) using a tool ( kraken etc). Then, we can speak in more details. Sorry I am asking many questions. You need to have > 90 % completetenes results to be able to say that transcriptome is fine for downstream analyses in my opinion.
Actually I have downloaded pair end RNA seq reads SRA349650 for assembly. reads were cleaned by trimmomatics:- adapter cleaning,q20, max read length 30bp. Assembled by Trinity with Group pairs distance- 500 bp,path reinforcement:- 50bp,min legth-200bp, assembled sequences were used for cap3 :- overlap 40bp, 90% identity. Then used for cegma and busco (BUSCO eukaryotes) (https://gvolante.riken.jp/index.html).
Please see http://www.acgt.me/blog/2014/9/15/understanding-cegma-output-complete-vs-partial.
I hope this will helps
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you so much all. Core genes were separated into four groups according to their conservative degree.
There is no good or bad answer to this. The average number of orthologs per predicted CEG might be expected to be much higher in polyploid genomes that have undergone several whole genome duplications. However, it might also be higher due to a genome assembly that has fully resolved heterozygous regions into two contigs. I.e. if gene X is sufficiently different in the two parental genomes (assuming diploid organism) then a genome assembler might assemble this into two separate sequences. This will artificially inflate many of the CEGMA output statistics.
The CEGMA statistics are most useful when you can do one of two things: