Extracting related sequences from a FASTA file
0
0
Entering edit mode
6.0 years ago
ATCG ▴ 400

How can I

  1. Compare long genomic sequences e.g 1-15kb and group them into families
  2. Look for a specific k-mer within these sequences
  3. FInd most frequently shared k-mers

Thank you!

Sequence comparizon Data mining kmer • 1.2k views
ADD COMMENT
0
Entering edit mode

You can use cdhit for clustering related sequences (based on sequence identity) . Identify the clusters, identify the sequences for each cluster and iterate motif finding tools on each cluster

ADD REPLY
0
Entering edit mode

You might consider using mash distances and define a cutoff sequence similarity.

Mash distances inherently use kmer distributions I believe, so you’d go a long way to addressing all these points at once with that approach.

ADD REPLY

Login before adding your answer.

Traffic: 1005 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6