What is the easiest way to take 2 genomes (bacterial proteins) and compare all genes from one vs all genes from another one?
I know this question was here like a million times, but I still don't get why most advice is about aligning sequences first. I don't need to align genes, I just want to really fast have a statistical info about how many genes from them are 100% similar, how many are 90% similar and so on for all genes.
So far I've tried proteinortho, mafft, oma, cd-hit, get_homologues and some others, but still no luck.
What alternate metric do you want propose for this comparison? Especially since you don't want to to do alignments (which would be needed to get % sequence similarity)?
ok, I actualy wasn't sure about that, thank you for clarifying! I am quite new to this, so I'm learning as it goes.
So my baldest option is to align all with all, and pick a metric (like number of blast identities divided by length of alignment)?
What exactly are you trying to achieve?
Doing all vs all comparison is feasible but then you would have to be judicious about where to set the cut-offs for similarity and blast/blat parameters to use. There will be orthologs/paralogs that you will find (besides all the common domains) that will give significant hits.
If you are more interested in finding out how these two genomes relate to each other (i.e. not at gene but at genome level) then you may want to look at Mauve genome alignment tool. It is designed for this type of analysis.
i am trying to get a numerical distribution, or a chart like this
(i don't need an actual graph - only distribution)
i'll look into mauve, thanks
So the idea would be to do an all vs all comparison maximizing the Query/Hit coverage (and perhaps only choosing the top hit, so you don't get bogged down with smaller domains etc).
An absolute answer would need lots of careful analysis but if you only want a gross overview this may work.
OrthoMCL solution noted below would also be another option. But it may complicate things since multiple sequences may be lumped together in a cluster.
ok, thanks for you answers - i will deffinetly try all suggestions!