Question

How To Extract The Core Genes From The Orthomcl Output File?

6

Entering edit mode

11.5 years ago

Lisa ▴ 330

Hi. I was wondering if anybody can help me figure out how to use Orthomcl to identify the core genome of E. coli genomes? I have 52 E. coli genomes that I used in orthomcl to produce ortholog groups. I followed all the steps in the user guide, until I got to the end. Now I'm left with this massive file of ortholog groups, but I'm unsure how to proceed.

This is a snippet from the middle of my output file, as the head command just gives too much information as it's my biggest ortholog group. The part before the colon is the ortholog group, the parts after that are genomes and genes which are clustered together into groups.

ecoli6370: col125|YP_006311412.1 col139|YP_007556103.1 col23|YP_001729413.1 col3|NP_286258.1 col4|NP_308598.1 col53|YP_002998320.1 col55|YP_003043686.1 col56|YP_003053130.1 col7|YP_488800.1 col73|YP_003498239.1 col92|YP_006127895.1
ecoli6371: col125|YP_006312035.1 col127|YP_006770839.1 col131|YP_006779890.1 col134|YP_006785029.1 col3|NP_286985.1 col31|YP_002271784.1 col4|NP_309246.1 col45|YP_002397150.1 col57|YP_003079099.1 col59|YP_003222735.1 col64|YP_003233659.1
ecoli6372: col125|YP_006312040.1 col127|YP_006770834.1 col131|YP_006779885.1 col134|YP_006785024.1 col3|NP_286990.1 col31|YP_002271776.1 col4|NP_309251.1 col45|YP_002397155.1 col57|YP_003079092.1 col59|YP_003222730.1 col64|YP_003233664.1

I tried converting this file to a binary matrix, following the instructions from here (http://smokeandumami.com/2010/01/21/gene-accumulation-curves-in-r/), but I'm still stuck with how to proceed.

Thanks, I appreciate any help you can give me. Please let me know if I should provide any more information.

Lisa

Sorry for the delay, here's an example of what my binary matrix looks like. I just took a few lines as it's so large.

"ecoli1000" "ecoli1001" "ecoli1002" "ecoli1003" "ecoli1004" "ecoli1005" 
"col0"   1   1   0   0   1   0
"col1"   0   1   0   0   0   1
"col2"   0   0   1   0   1   1
"col3"   0   1   0   0   0   0
"col4"   1   0   0   1   1   1
"col5"   1   0   0   1   0   0

orthomcl • 8.0k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 11.5 years ago by Lisa ▴ 330

0

Entering edit mode

Could you show us the binary matrix? I believe it'll be easier to explain it from that.

ADD REPLY • link 11.4 years ago by sentausa ▴ 650

score 4 · Answer 1 · 2014-03-07

4

Entering edit mode

11.4 years ago

sentausa ▴ 650

Anyway, I'll try to explain it without the binary matrix.

Since you are interested to find the core genes, basically all you have to do is to find ortholog groups from the OrthoMCL results that contain all 52 strains. If a strain does not have a gene/protein in an ortholog group, it means that this gene/protein is absent in the strain. Therefore, this gene/protein is not part of the core genome, since the definition of a species' core genome is all genes that belong to all strains of the species.

So, in the binary matrix shown on the blog, you'd be interested only to the columns that have no 0 in them.

ADD COMMENT • link 11.4 years ago by sentausa ▴ 650

0

Entering edit mode

Thanks that makes a bit more sense. It seems really simple when you say it like that, so I think I was just having temporary brain melt or something.

ADD REPLY • link 11.4 years ago by Lisa ▴ 330

score 1 · Answer 2 · 2019-07-06

1

Entering edit mode

6.1 years ago

Dattatray Mongad ▴ 390

Use this code parseOrthoMCLOutput.py. It will generate all core, accessory and uniq genes fasta files.

ADD COMMENT • link 6.1 years ago by Dattatray Mongad ▴ 390

Ram · Answer 3 · 2014-09-07

0

Entering edit mode

10.9 years ago

amanjain • 0

I have a very very simple way to find core gene clusters through excel. Tell me if anyone needs help........

If anyone needs help on venn diagrams try http://bioinformatics.psb.ugent.be/webtools/Venn/ it will do your work in seconds.

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 10.9 years ago by amanjain • 0

0

Entering edit mode

Hi, I need help with this very very simple way to find core gene clusters through excel. Could you explain me how?

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.8 years ago by marcelokuchar • 0