Question

What Is The Best Way To Find Core Genome Of Many Bacteria.

0

Entering edit mode

12.0 years ago

Naren ▴ 1000

There are two ways of finding core genome in bacteria.
Suppose we have five genomes. (of course I need it for around 30 genomes)

Method 1:
Blast genome 1 and 2. then blast this core set with next genome i.e. 3
then blast this core set with next genome i.e. 4 and so on...

Method 2:
Blast genome 1 with each 2,3,4 and 5.
then just find which genes are common between outputs 1-2, 1-3, 1-4, 1-5.....(using some gi id comparison program in perl. that I have.)

*What among the above methods seems more accurate and time saving, if done for 30 genomes??? *

genome • 3.5k views

ADD COMMENT • link updated 12.0 years ago by Neilfws 49k • written 12.0 years ago by Naren ▴ 1000

score 3 · Answer 1 · 2012-11-26

3

Entering edit mode

12.0 years ago

Neilfws 49k

Can I suggest - again, as for your previous question - that CD-HIT might be a good tool for this task? A quick Google search for "core genome" + CD-HIT suggests that I'm not the first to have this idea.

ADD COMMENT • link 12.0 years ago by Neilfws 49k

0

Entering edit mode

+1 for CD-HIT with the caveat that CD-HIT is for clustering sequencing reads for operational taxonomic units, but if we think of predicted genes (ORFs in this case? -- not sure what a "core set" is in this example? genes? syntenous regions?) you could cluster "hits" into commonly shared and unique to each genome. Not sure how long that would take computationally as I've only used CD-HIT on amplicon data of a single gene family.

ADD REPLY • link 12.0 years ago by Josh Herr 5.8k

0

Entering edit mode

I guess I would just take all predicted ORFs in fasta format, concatenate and throw into CD-HIT, then parse the output and link members of each cluster back to their organism.

ADD REPLY • link 12.0 years ago by Neilfws 49k

0

Entering edit mode

+1 for CD-HIT....

ADD REPLY • link 11.6 years ago by Naren ▴ 1000

score 2 · Answer 2 · 2012-11-26

I don't see differences using any of the two methods proposed to obtain a list of core genes, because you keep constant the search space, which defines the score and Evalue for Blast, but of course, maybe anyone else can differ. In any case, Method 1 will be faster, because in each iteration you're removing elements in your search to the next genome.