Entering edit mode
12 months ago
bioinformagician
•
0
Hi!
I have over 100 .faa files representing different bacteria. I want to cluster bacteria according to the proteins. What sort of distance matrix can I generate between .faa files?
Any suggestion for this approach?
Thank you.
I agree with Joe, it's hard to tell what you want. But here are some general suggestions:
Thank you @Joe dthorbur . I had one fasta file per bacteria, then I run prodigal to extract only the coding sequences (CDS) of a bacteria, obtaining the .faa file. Now I want to cluster the .faa files, obtain N clusters and then extract a sequence that identifies each cluster.
My goal is not to obtain phylogeny but rather to obtain a sequence that generally represents the cluster. Therefore the first step would be to identify the clusters based on similarity between sequences of each .faa.
I already though of a way to do this, that would be:
However this would be computacionally difficult given number of possible proteins. Columns would be very large. So i would like to cluster the sequences based on similarity distances, create clusters then get that general sequence representing the cluster.
I think you're trying to reinvent the wheel a little bit. dthorbur is on the right track - what you're essentially describing is something like wgMLST.
I would start with a tool like
roary
which will cluster proteins as part of its process. From those clusters you can later decide how you want to pick a representative example.Roary will ingest annotated genomes in GFF format, so the first step will actually be to start over and generate new input files. Prodigal is good at what it does, but its its a little bit crude to treat the output as the total protein content of the bacteria. Much better to use a proper annotation pipeline like
prokka
.I think you need to clarify the question a bit.
is each .faa the proteome of a single particular bacteria?
What are you aiming to cluster by? Average sequence identity?
Tools like
CD-HIT
exist specifically for this, but they don't infer any phylogeny etc.