I wonder if we can do comparative genomics using draft genomes from shotgun sequencing. I mean, since there might be uncharacterized genes or proteins because they are located in the gap between the contigs, will it still be "valid"? Do you think we will get much less information if we do comparative genomics using contigs or scaffolds from shotgun sequencing compared to the whole genome for bacteria?
If contigs or scaffolds are still OK, what kind of analyses can we perform?
To answer your question directly: Yes, you can do comparative genome studies with draft genomes. I would argue that all genomes are draft genomes, some are more complete than others, but no sequenced "genome" is an exact representation of what consists in a cell. There's a recent discussion of this topic at Titus Brown's blog. The bottom line is we, as genomicists, should regard a certain margin of error with any genome assembly from sequencing data.
All genome assemblies will have different levels of uncertainty and can be considered drafts, some are in better shape than others, so your concept of "valid" is a vague one for comparative genomics. I don't quite understand what you mean in regards to "shotgun sequencing compared to the whole genome for bacteria"? There are lots of different methods for generating sequencing data, perhaps some are better than others, but if you are able to assemble short reads (or any size actually) into larger contigs or scaffold with confidence then you should be able to start to conduct comparative analyses with them. There are lots of methods to compare sequences: one could argue that sequence comparison is the central tenet of bioinformatics.
Lastly, you did not give us any context with what you want to do in comparing sequences. I read your post to be purely comparative. If you are interested in metagenomics then you will compare short reads (typically not assembled contigs or scaffolds) to a database. Glimmer-MG, as Larry mentions, is a tool to do this, but it does not take assembled reads. It uses a HMM algorithm to quickly "match" short reads to a database. You can do the same thing with BLAST or BLAT. As GLIMMER-MG hasn't been maintained in many many years, I would suggest other methods of identifying short reads with a database if metagenomics is your goal. If you are strictly interested in comparative genomics there are lots of tools that are discussed here at this forum, but you'll have to be a little more specific on your research questions for us to help you.
Thank you for your reply. I'm mostly interested in using contigs or scaffolds of known bacteria and compare them to those of other bacteria, so it's not metagenomics. I used Orthomcl to check which proteins are present or absent from which bacteria, but then I became unsure if it is really a good idea, since the missing proteins in the gaps are not counted. I mean, I believe that I will need to complement that analysis with something else.
A tool that you may find helpful in this regard is GLIMMER-MG. This analysis tool is a metagenomics gene prediction system capable of "significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error."
Thank you for your reply. I'm mostly interested in using contigs or scaffolds of known bacteria and compare them to those of other bacteria, so it's not metagenomics. I used Orthomcl to check which proteins are present or absent from which bacteria, but then I became unsure if it is really a good idea, since the missing proteins in the gaps are not counted. I mean, I believe that I will need to complement that analysis with something else.