Question

phylip format sequence name/header issue

0

Entering edit mode

2.3 years ago

hafiz.talhamalik ▴ 350

my sequences have names of more than 10 characters. but looks like phylip format takes only names with characters less than 10. How to deal with it ??? I tried shortening the names but cannot do that for all sequences

phylogentics • 1.9k views

ADD COMMENT • link 2.3 years ago by hafiz.talhamalik ▴ 350

score 2 · Answer 1 · 2022-08-15

2

Entering edit mode

2.3 years ago

Michael 55k

I assume you have the sequences in a FASTA file. I recommend to shorten the sequences in FASTA format already, then convert to phylip format using EMBOSS tool seqret.

I think there should be many scripts written by folks doing phylogenetics to cope with header length issues. Here is one I just found https://github.com/nylander/translate_fasta_headers It seems it can do what Mensur suggests and then rebuild the original sequence ids in the Newick output.

ADD COMMENT • link 2.3 years ago by Michael 55k

0

Entering edit mode

Thanks will look into it.

ADD REPLY • link 2.3 years ago by hafiz.talhamalik ▴ 350

score 0 · Answer 2 · 2022-08-15

0

Entering edit mode

2.3 years ago

Mensur Dlakic ★ 28k

Phylip format can have an arbitrary number of characters in header, but not all the programs will tolerate it. MrBayes, for example, has no complaints when headers are longer.

but cannot do that for all sequences

Of course you can. Each name can be replaced with an arbitrary short string (say, d45e3r) until you perform the analysis, and then replace these short strings with your original names.

ADD COMMENT • link 2.3 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

MrBayes will complain if the first 15 characters of the sequences do not lead to unique taxa though.

ADD REPLY • link 2.3 years ago by Michael 55k

score 0 · Answer 3 · 2022-08-15

If you're using Phylip programs with DNA or protein sequences, you can do a full phylogenetic workflow using the BIRCH system. BioLegato, the graphic user interface for BIRCH, automatically translates sequence names to a short random name compatible with Phylip, and then restores the names in the output. Name translation is done by uniqid.py. An example of Phylip output with long sequence names is shown below: enter image description here

Further examples can be see no the BioLegato tutorials page under the heading "Phylogeny".