I would like to study the conserved synteny among a protein family in 200 bacterial genomes. This protein family has, for instance, 13 paralogs in one of the strains.
I've designed a strategy to do this but maybe someone knows a better or faster solution. Maybe can be approximated with a genomic island finding approach [1].
We have some clues that mobile genetic elements are present near these regions and could be playing a role in the evolution of the family proteins, being inside genomic islands.
I've thought a couple of approximations to deal with this problem:
- Whole genome comparison of the 200 genomes. However, this will tell me if the genes are or not present in each genome but not where they are.
- Automate the classic synteny process. For each 5' and 3' gene of each protein of the family, perform a BLAST against the genomes/predicted proteomes of each of the 200 strains. If found the gene, print 8 Kb 5' and 8 Kb 3'. Store the sequences of each strain in a file. I will end up with 200 files. Once performed this step, compare the proteins of each strain and report a presence/absence matrix.
Thanks
[1] Nat Rev Microbiol. 2010 May;8(5):373-82. doi: 10.1038/nrmicro2350. Detecting genomic islands using bioinformatics approaches. Langille MG1, Hsiao WW, Brinkman FS.
http://www.ncbi.nlm.nih.gov/pubmed/20395967
UPDATE: I'm now reading about some tools for large scale synteny tasks: ("Synteny/genetics"[Mesh]) AND "Software"[Mesh]