Entering edit mode
10.3 years ago
Owen S.
▴
370
Can anyone recommend a software solution to do this:
- Input: about 100,000 short peptide sequences -- unaligned -- of varying lengths, but mostly under 20 residues.
- Output: amino-acid profiles (e.g. sequence logo map) describing similar over-represented kmers (say, 3-or 4- or 5-mers).
I can think of ways to tackle this myself*, but why re-invent the wheel? Hoping that my question and any discussion that follows may also help others.
Thanks!
PS. My approach would be something like this:
- count all unique kmers
- calculate pairwise distances
- select clusters (clades) of similar kmers
- use these kmers (and their counts) to build sequence logo maps
Thanks, but my question relates to peptide, not nucleotide, sequences. (The Biostrings function you suggested only works with nucleotide seqs.)