Question

Muscle Multiple Sequence Alignment: How To Allow Alignment With A Sequence That Is Just Gaps

1

Entering edit mode

12.6 years ago

Klugman ▴ 20

I am currently using Clustal to align about orthologous proteins for about 50 species, but would like to use MUSCLE instead.
Since I am examining thousands of proteins, I use the linux binary of MUSCLE.

the Problem

MUSCLE appears to not accept "empty" input sequences. That is, Protein X is not present in, say, the bear, and this is shown as lines with dashes/gaps:

Input

>ProteinX_human 
ABC
>ProteinX_cat 
A-C
>ProteinX_bear
---

Output

the MUSCLE output alignment file will not include >ProteinX_bear.

Question

How do I go about to ensure that MUSCLE will output the alignment with >ProteinX_bear just showing dashes/gaps/- throughout its alignment? I cannot find any information about how to achieve this in the MUSCLE manual, although I am new to bioinformatic and could be bixblind, so to speak. It is very important for my downstream analysis that species lacking AAs are included in the alignment output.

thankyou for your help, and I hope my question is clear.

multiple sequence alignment • 6.6k views

ADD COMMENT • link 12.6 years ago by Klugman ▴ 20

score 2 · Answer 1 · 2012-04-17

2

Entering edit mode

12.6 years ago

Andreas ★ 2.5k

It's hard to imagine why you would need this feature. Anyway, have you tried to replace gaps in the gap-only sequences with Xs (X=any amino acid)? That fake sequence would also contain no information, but Muscle will at least report the sequence containing only Xs in the output.

Andreas

ADD COMMENT • link 12.6 years ago by Andreas ★ 2.5k

score 1 · Answer 2 · 2012-04-17

I have never seen anyone try to do this before. If Protein X is not present in bear, why would you even want to align it?

The residues in a multiple sequence alignment contribute information. No residues = no information. In other words, even if you did include a sequence containing only "-", it would not contribute anything meaningful to later analyses.

That said: I'd really like to know what happens were one to edit the alignment manually (probably the only way to do it), insert an "all gaps" sequence and try a subsequent analysis.

score 1 · Answer 3 · 2012-04-18

thankyou Andreas and Neilfws - I appreciate your input and will give the dash to non-AA letter replacement vs MUSCLE a go.

I am (obviously ;) ) very new to bioinformatics, and wrote a downstream Perl script that requires (clunky and would probably make many bioinformaticians cry, but it does its job) that all the species are in the same order for each protein's alignment, hence the need to include information-less data in each alignment.

Update: Just realised I forgot to mention that I am combining UCSC multiway CDS alignments with unpublished protein sequences from our lab. Sorry about not being clear.

I just wrote a Perl script that replaces the gaps with X's prior to running MUSCLE. The subsequent MUSCLE alignments look good.

Thanks again.