Question

Align multiple gene clusters: mummer skips random sequences in multi fasta files

0

Entering edit mode

7.8 years ago

dago ★ 2.8k

I have a multi fasta containing 100s of gene clusters. I would like to align them and calculate a % identity. I tried to use mummer, using the same multi fasta as both query and reference. At first I thought that it worked really well, but then I saw something odd.

There are two clusters in the multi fasta that I know are quite similar. If I take them singularly (cluster1 and cluster2) Mummer can align them. However, when I extract the info from the .delta file I see that I get 3 alignments covering different regions of the sequences. For me this was the first odd thing. Why don't I get an overall alignment, rather than 3 split ones?

EDIT

Just to clarify this first point:

[S1]    [E1]    [S2]    [E2]    [LEN 1] [LEN 2] [% IDY] [LEN R] [LEN Q] [COV R] [COV Q] [FRM]   [TAGS]
2207    15379   1   13169   13173   13169   94.28   63473   63289   20.75   20.81   1   1   cluster1 cluster2
15379   45622   15378   45438   30244   30061   93.97   63473   63289   47.65   47.50   1   1   cluster1    cluster2
49149   63473   47597   61917   14325   14321   93.71   63473   63289   22.57   22.63   1   1   cluster1    cluster2

As you can see from the coordinates ([S1/2], [E1/2]) the first and second alignments are quite close. Even if I allow a gap of over 5000 (-b 5000) the result does not change.

The second odd thing is that the above mentioned clusters (1, 2) are not aligned when they are in the multi fasta containing 100s of sequences. If I run mummer on the multi fasta and then I look at the statistics, the alignments between these 2 clusters is simply missing. So they can be aligned when I use each of them as singular fasta, but not when they are together wit other sequences in the multi fasta. What do I miss here?

I also posted this question of the mummer sourceforge help, but I am not sure how often it is updated. So, sorry in advance for double posting!

alignment sequence gene genome • 2.4k views

ADD COMMENT • link 7.8 years ago by dago ★ 2.8k

0

Entering edit mode

Wrong tool for the job. E.g. cd-hit would suit your needs far better..

ADD REPLY • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

Thanks for the answer. Why do you think so?

ADD REPLY • link 7.8 years ago by dago ★ 2.8k

0

Entering edit mode

MUMMER is designed to find maximal matches between genomic sequences. Your input data sounds like something completely different..

ADD REPLY • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

Well, my input data are indeed portion of genomes, and I want to find maximal matches amongst them. So for me it sounds quite good.

ADD REPLY • link 7.8 years ago by dago ★ 2.8k

0

Entering edit mode

MUMMER assumes that your query and reference files both represent single similar(ish) genomes. If you read through the manual, you'll e.g. find a section titled "Use cases and walk-throughs". It's all about complete/near complete genomic sequences, not fractions of multiple genomes. MUMMER is great for one thing: whole-genome alignments. For everything else, choose another program..

ADD REPLY • link 7.8 years ago by 5heikki 11k