Previous title was "sequence similarity analysis between groups of closely related DNA sequences". Was edited for clarity
Hello,
I'll start by trying to explain my problem briefly: The data: We got a list of a few hundreds short sequences (pre-miRNA sequences with length of ~100). Then we expanded the list by modifying those sequences (sometimes modifying the whole sequence completely sometimes modifying only a few nucleotides). So now we have a few thousand of short sequences and some of them very similar to each other. We did some experiments with those sequences and we separated them into groups.
The goal: we want to find how different groups of sequences might differ from other groups.
I thought tools for motif analysis might help me but from a quick search it looks like most of them geared towards genome analysis and not for individual sequences.
Any ideas or directions for me to search?
Thanks,
Artem.
EDIT: additional information
I'll try to add more details to explain the problem more clearly. Here is an example from our fasta:
>Reversed_hsa-mir-6791
GACGGCCTCTGGTTCCTCCGTCTCCGTCAAGTCTGGAGAAAGGCGGACGGGTCGGGGTCCCCAGACC
>AllPreMir_hsa-mir-6791
CCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCCTCCTTGGTCTCCGGCAG
>SeedChanged_hsa-mir-6791_5p_seed_changed
CCAGACCCGACGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCCTCCTTGGTCTCCGGCAG
>SeedChanged_hsa-mir-6791_3p_seed_changed
CCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCGAGCTTGGTCTCCGGCAG
>Scrambled_hsa-mir-6791
TCCGGCTGCCCCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAATCTGCCTCCTTGGTCCAG
>Reversed_hsa-mir-18a
ACGGTCTTCCTCGTGAATCCCGTCATCTACGATTAGATGAAGTGATAGACGTGATCTACGTGGAATCTTGT
>AllPreMir_hsa-mir-18a
TGTTCTAAGGTGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA
>SeedChanged_hsa-mir-18a_5p_seed_changed
TGTTCTAACCAGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA
>SeedChanged_hsa-mir-18a_3p_seed_changed
TGTTCTAAGGTGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTCGGCTAAGTGCTCCTTCTGGCA
>Scrambled_hsa-mir-18a
TGTTCTAAGGTGCATCTAGTAAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCATAGTGGCAG
So we got the original sequence tagged with "AllPreMir" then we designed a few modifications to create 4 additional sequences from the original, one is reversed order, one is random order and two sequences with only 3 different nucleotides.
After designing the sequences we run a few experiments and got different groups of sequences (based on experimental results), for example group one would be:
>SeedChanged_hsa-mir-6791_5p_seed_changed
CCAGACCCGACGGGCTGGGCAGGCGGAAAGAGGTCTGAACTGCCTCTGCCTCCTTGGTCTCCGGCAG
>AllPreMir_hsa-mir-18a
TGTTCTAAGGTGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA
>SeedChanged_hsa-mir-18a_5p_seed_changed
TGTTCTAACCAGCATCTAGTGCAGATAGTGAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCA
And the second group would have:
>Scrambled_hsa-mir-6791
TCCGGCTGCCCCAGACCCCTGGGGCTGGGCAGGCGGAAAGAGGTCTGAATCTGCCTCCTTGGTCCAG
>Reversed_hsa-mir-18a
ACGGTCTTCCTCGTGAATCCCGTCATCTACGATTAGATGAAGTGATAGACGTGATCTACGTGGAATCTTGT
>Scrambled_hsa-mir-18a
TGTTCTAAGGTGCATCTAGTAAAGTAGATTAGCATCTACTGCCCTAAGTGCTCCTTCTGGCATAGTGGCAG
And now my goal is to find meaningful differences between the two groups (while the actual files would have hundreds of sequences some could be very similar). Such differences could be different k-mers, different "consensus" sequences, different motifs, something else??? And in a perfect world there would be a tool (or several tools) that would perform those analysis on a fasta file and output those metrics to a file that I could compare between the different groups of sequences.
Have you tried different metrics with which you can comment on the degree of variance in between different groups of sequences? For example : pairwise nucleotide differences in between a group and among different groups? Or, maybe have a "consensus" sequence for each "group" and then check its pairwise nucleotide difference with "consensus" sequence other groups?
Hello, thanks for the comment. I don't have any metrics in particular and I thought to get those from this questions. I thought about checking for; different k-mers abundance, different GC content, different consensus sequence. Though, maybe users of the forum think of something different/more specific. And even more helpful would be some kind of tool/script that can check those things and report those statistics to a file.
You could use
clumpify.sh
from BBMap suite toclump
the sequences based on their sequence similarity (A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files ). This will work with both fastq/fasta data.You could then take representative sequences from the group to build "phylogenetic trees" that can display their potential relationships.
Thanks for the suggestion. I tried reading the description of the tool briefly and it looks like it could be useful for my problem. Though I'm not sure how to exactly use it; lets say I have two fasta files with 200 sequences in each and I know to know if there are some metrics by which the sequences in those two files differ, I run the tool on both files and receive a file where similar sequences are clamped together. Then how would you suggest to proceed?