I have a set of ~1000 bacterial secondary metabolite gene clusters that all encode biosynthesis of the same class of compound, based on a couple marker genes common to all clusters. However, they also have variable regions which encode a wide variety of modifying enzymes. I have nucleotide fasta files and genbank files of each cluster and they are all about 40kb in length. Does anybody know of a good way to group these gene clusters into families (GCFs)?
So far I've been able to get decent results by:
- Grouping based on similarity of the marker gene sequences and using MAUVE to visualize conservation of gene context
- Clustering the context genes into orthologous groups and finding which organisms have orthologs in common
My next idea is to compute a pairwise tblastx distance matrix among all 1000 gene clusters and group based on distance scores using a clustering algorithm (e.g. CLANS).
Has anyone attempted a similar task and found a more robust way of grouping into GCFs? These methods require quite a bit of manual fiddling and fail to take into account important features such as synteny. Thanks!
CD-HIT is made for clustering sequences, give that a try.
Your dataset is going to be too large to do a standard multiple sequence alignment and hierarchical clustering approach I think, but you might be able to employ something like the mash/minhash distances between all your sequences to cluster as well.