I am trying to apply different clustering algorithms on sequence data with lengths about 240-260. The sequences mainly come from nex-generation sequencing technology. For any clustering (OTU) method, we need some notion of distance among points/objects to assess how far/close objects/sequences are from each other. So far I have been using the edit-distance for this purpose, but now I want to use a more biologically relevent distance!
I have found some measures such as Kimura distance, Gamma distance, ... but among all these distance I don't know which one would fit to the type of data that I have! Is Kimura applicable for 16S rRNA sequences? Do you have any other suggestions, or papers that reviews application of these evolutionary distance measures for RNA/DNA fragments (not the whole genome)
Below, I have provided some detailed informations for the dataset that I am studying:
One example sequences in the dataset:
>M02127_29_000000000-A9TRU_1_1106_3986_10247
UAC--GG-AA-GGU---CCG-G-G-C-G-U--U--AU-C-CGG-AU----UU-A--U-U--GG-GU---UU-A----AA-GG-GA-GC--G-UA-G-G-C-C-G--G-UC-U-U-U---AA-G-C-G-U--G-C-C-G--UG--A-AA-UU-U-U-GU-G-G--CU-C-AA-C-C-A-U-G-A-G-A-G--U-G-C-G-G-C-G--CGA-A-CU-G-G--AG-A-C-C-U-U-G-A-G-U--G-C-GC--GG-A-A-G-G-C-A--GG-C--GG-A--AUU--CG-U-G-GU--GU-A-G-CG-GU-G-A-A-A-UG-C-UU-AG--AU-A-UC-A-C-G-A-A-G-A-AC-C-CC--GA-U-U-GC-GAA-GG-C-A-G--C-C-U-G--CCG-C--AG-C-G-U-U-----A-C-U--GA--CG-C-U-G-A-AG-C-U-CG-A--AA-G-C-G-CG--GG-U--AU-C-G-AA-CAGG
Here is the taxonomy levels for 10 sequences that I have:
Bacteria Bacteroidetes Bacteroidia Bacteroidales Prevotellaceae unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Bacteroidaceae Bacteroides
Bacteria Firmicutes Clostridia Clostridiales unclassified unclassified
Bacteria Firmicutes Clostridia Clostridiales Lachnospiraceae unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Rikenellaceae Alistipes
Bacteria unclassified unclassified unclassified unclassified unclassified
Bacteria Bacteroidetes Bacteroidia Bacteroidales Porphyromonadaceae unclassified
Bacteria Firmicutes Clostridia Clostridiales Ruminococcaceae unclassified
Bacteria unclassified unclassified unclassified unclassified unclassified
I am confused what your actual question is. You mention you want to cluster 16S sequences, but then you also make reference to Kimura and Gamma distances which are typically used to measure evolutionary metrics of presumed sequence change. These are different analyses to address different research questions.
Can you tell us something about your research question and where your data comes from (what type it is -- mixed sample, single bacterial genome) -- this will help us to actually be able to guide you with what you want to do.
Hi Josh, thanks for your comment! I have updated my question, provided some an sequence and some taxonomy levels from my data.
I want to use a proper distance measure for clustering biological sequences into OTUs. I want to know among all these distance metrics, what metric is sutiable for this kind of data that I have!