There are two ways of finding core genome in bacteria.
Suppose we have five genomes. (of course I need it for around 30 genomes)
Method 1:
Blast genome 1 and 2.
then blast this core set with next genome i.e. 3
then blast this core set with next genome i.e. 4 and so on...
Method 2:
Blast genome 1 with each 2,3,4 and 5.
then just find which genes are common between outputs 1-2, 1-3, 1-4, 1-5.....(using some gi id comparison program in perl. that I have.)
*What among the above methods seems more accurate and time saving, if done for 30 genomes??? *
+1 for CD-HIT with the caveat that CD-HIT is for clustering sequencing reads for operational taxonomic units, but if we think of predicted genes (ORFs in this case? -- not sure what a "core set" is in this example? genes? syntenous regions?) you could cluster "hits" into commonly shared and unique to each genome. Not sure how long that would take computationally as I've only used CD-HIT on amplicon data of a single gene family.
I guess I would just take all predicted ORFs in fasta format, concatenate and throw into CD-HIT, then parse the output and link members of each cluster back to their organism.
+1 for CD-HIT....