I have a multi fasta containing 100s of gene clusters. I would like to align them and calculate a % identity. I tried to use mummer, using the same multi fasta as both query and reference. At first I thought that it worked really well, but then I saw something odd.
There are two clusters in the multi fasta that I know are quite similar. If I take them singularly (cluster1 and cluster2) Mummer can align them. However, when I extract the info from the .delta file I see that I get 3 alignments covering different regions of the sequences. For me this was the first odd thing. Why don't I get an overall alignment, rather than 3 split ones?
EDIT
Just to clarify this first point:
[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [COV R] [COV Q] [FRM] [TAGS]
2207 15379 1 13169 13173 13169 94.28 63473 63289 20.75 20.81 1 1 cluster1 cluster2
15379 45622 15378 45438 30244 30061 93.97 63473 63289 47.65 47.50 1 1 cluster1 cluster2
49149 63473 47597 61917 14325 14321 93.71 63473 63289 22.57 22.63 1 1 cluster1 cluster2
As you can see from the coordinates ([S1/2], [E1/2]) the first and second alignments are quite close. Even if I allow a gap of over 5000 (-b 5000
) the result does not change.
The second odd thing is that the above mentioned clusters (1, 2) are not aligned when they are in the multi fasta containing 100s of sequences. If I run mummer on the multi fasta and then I look at the statistics, the alignments between these 2 clusters is simply missing. So they can be aligned when I use each of them as singular fasta, but not when they are together wit other sequences in the multi fasta. What do I miss here?
I also posted this question of the mummer sourceforge help, but I am not sure how often it is updated. So, sorry in advance for double posting!
Wrong tool for the job. E.g. cd-hit would suit your needs far better..
Thanks for the answer. Why do you think so?
MUMMER is designed to find maximal matches between genomic sequences. Your input data sounds like something completely different..
Well, my input data are indeed portion of genomes, and I want to find maximal matches amongst them. So for me it sounds quite good.
MUMMER assumes that your query and reference files both represent single similar(ish) genomes. If you read through the manual, you'll e.g. find a section titled "Use cases and walk-throughs". It's all about complete/near complete genomic sequences, not fractions of multiple genomes. MUMMER is great for one thing: whole-genome alignments. For everything else, choose another program..