Entering edit mode
6.0 years ago
ATCG
▴
400
How can I
- Compare long genomic sequences e.g 1-15kb and group them into families
- Look for a specific k-mer within these sequences
- FInd most frequently shared k-mers
Thank you!
You can use cdhit for clustering related sequences (based on sequence identity) . Identify the clusters, identify the sequences for each cluster and iterate motif finding tools on each cluster
You might consider using
mash
distances and define a cutoff sequence similarity.Mash distances inherently use kmer distributions I believe, so you’d go a long way to addressing all these points at once with that approach.