Question

seqboot ERROR: sequences out of alignment (SOLVED)

0

Entering edit mode

17 months ago

mcsimenc ▴ 20

*Edit: The problem was incompatible sequence headers (see below).

I need to run bootstrap analysis using FastTree with multiple protein sequence alignments. I've used seqboot before for this with multiple dna sequence alignment, but it doesn't seem to be able to sample protein alignments. Does anyone know of a command line program I can use to generate randomly sampled multiple protein alignments?

alignment sequence protein bootstrap phylogenetics • 1.8k views

ADD COMMENT • link updated 17 months ago by ATpoint 85k • written 17 months ago by mcsimenc ▴ 20

score 2 · Accepted Answer · 2023-06-17

2

Entering edit mode

17 months ago

Mensur Dlakic ★ 28k

In my hands seqboot works with protein sequences - see below. Is there an error message when you try to use it?

seqboot
seqboot: can't find input file "infile"
Please enter a new file name> bacteria-original.phy

Bootstrapping algorithm, version 3.697

Settings for this run:
  D      Sequence, Morph, Rest., Gene Freqs?  Molecular sequences
  J  Bootstrap, Jackknife, Permute, Rewrite?  Bootstrap
  %    Regular or altered sampling fraction?  regular
  B      Block size for block-bootstrapping?  1 (regular bootstrap)
  R                     How many replicates?  100
  W              Read weights of characters?  No
  C                Read categories of sites?  No
  S     Write out data sets or just weights?  Data sets
  I             Input sequences interleaved?  Yes
  0      Terminal type (IBM PC, ANSI, none)?  ANSI
  1       Print out the data at start of run  No
  2     Print indications of progress of run  Yes

  Y to accept these or type the letter for one to change
Y

Random number seed (must be odd)?
3267

completed replicate number   10
completed replicate number   20
completed replicate number   30
completed replicate number   40
completed replicate number   50
completed replicate number   60
completed replicate number   70
completed replicate number   80
completed replicate number   90
completed replicate number  100

Output written to file "outfile"

Done.

ADD COMMENT • link 17 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Excellent, the error I'm seeing must be due to something else then:

ERROR: sequences out of alignment at site 103 of species 22

The sequences in the alignment FASTA file are all the same length. This position is a gap -

I'm using relaxed phylip format with long sequence headers. Maybe seqboot can't work with relaxed format?

seqboot: can't find input file "infile"
Please enter a new file name> OG0000006.fa.phylip

    Bootstrapping algorithm, version 3.697

Settings for this run:
  D      Sequence, Morph, Rest., Gene Freqs?  Molecular sequences
  J  Bootstrap, Jackknife, Permute, Rewrite?  Bootstrap
  %    Regular or altered sampling fraction?  regular
  B      Block size for block-bootstrapping?  1 (regular bootstrap)
  R                     How many replicates?  100
  W              Read weights of characters?  No
  C                Read categories of sites?  No
  S     Write out data sets or just weights?  Data sets
  I             Input sequences interleaved?  Yes
  0      Terminal type (IBM PC, ANSI, none)?  ANSI
  1       Print out the data at start of run  No
  2     Print indications of progress of run  Yes

  Y to accept these or type the letter for one to change
Y

Random number seed (must be odd)?
1


ERROR: sequences out of alignment at site 103 of species 22

ADD REPLY • link 17 months ago by mcsimenc ▴ 20

1

Entering edit mode

ERROR: sequences out of alignment at site 103 of species 22

Don't know what in that error message made you think this is about protein sequences specifically. Sounds to me like an error in alignment formatting. It is telling you exactly where to look for that error: sequence #22 from top and column position 103. Something there is not what it should be, and my educated guess is that you will have a different gap character than - that you claim is in the alignment.

In my hands seqboot works with phylip files and long sequence headers. For example this alignment works (trimmed both for width and length to save space):

 146 5040
Thiothrix_nivea                                  RSRVRSNTVKGDFFAVQPLTVDKTVLGIAQG
Caldatribacterium_saccharofermentans             RTKVTEKTVREEYFWVKPLLTGRRVHGIALN
Zixibacteria_RBG-1_sp000447245                   KERVSKSTVKEQYLLVQPLLTDKRALGLSMD
Hydrogenedens_terephthalicus                     RRRVKSKIVKEEYLLARALVTGMLCHGIAEN
Calescibacterium_nevadense                       ARKVWTEYVIDKYFLVRPLVTNKEVHGISKE
Moranbacterales_UBA11712_sp000995965             RSSVIKSKIRLNYFLISPFAVGDVALAISNA
Babeliaceae_UBA12395_sp000996275                 RSRVRKEVVLEKYFAVGLITIPRTVLGIAQE
WOR-3_SM23-42_sp001303785                        STKITTVKVRLQYFLVLPLTAGNKIQGIAKE
Firestonebacteria_D2-FULL-39-29_sp001778375      RTKVIDKTVREEYFHVKPLLLGKLLLGIAQD
Schekmanbacteria_2-02-FULL-38-14-A_sp001790855   RTRVREKTVKEDYFLIHPLLTNKTVHGIALN
UBA10199_2-12-FULL-40-28_sp001798135             KTRLTDKVVREAYFYVQPMVTGKTVHGLATD
Omnitrophales_2-12-FULL-44-17_sp001804285        RRRVFTDSVREDFFHIQPFATKRSVLGIAVK
Spirochaetota_GWE2-31-10_sp001829315             RTRAAKNVIKKDYFFVQPLVTNSNVHAIALK
Aminicenantia_JdFR-78_sp002010665                KTRVRKNIVRDEYFQIKPLLTGKRVLGIADE
Bipolaricaulales_UBA3571_sp002011425             RERVKERWVRAEYFVLTPFITGRRVLGIAGE
SAR324_Arctic96AD-7_sp002082305                  RDRVRKEVVREDYFLLQPLLTEKNVHGISLD
Thiomicrospirales_GCA-002282575_sp002281095      RSRVRSKVIREQHFDVQPLVTARNVHGLAAE
Hydrothermia_UBA1063_sp002316275                 SRRLKRRYILQEYFLAEPLITGSKVLALADN
Bdellovibrionota_C_UBA2361_sp002343185           RRRVKSATVKESYFLVQPLLSEKTVLGLAQE
Fimbriimonadales_UBA6659_sp002344135             RCRVEEKQIRKEYLLVSPFHLGKTALGIAEE
Bipolaricaulales_UBA3571_sp002375995             RERVRERWVREEYFILTPFITGRRVLGIAGE
Hydrothermia_UBA1063_sp002421425                 SRRLKKRYILQEYFLAEPLITGSKVLALATN
Sulfuricurvum_sp002633015                        RTKVATKTVLDAWLKVQPLLTKSTTHGISAS
Poribacteria_TMED15_sp002714785                  -------------------------------
Gemmatimonadetes_GCA-2718595_sp002718595         RTRVKKLSVKENFFYVQPLAGEQRVMGLAAE
Sabulitectum_sp002748705                         RSRCRSRVVKVEYFQAEPLTVKSNVMGIAED
Thermochlorobacter_sp002763895                   RTRAKSKLVRENYFQAKPLLTDKNVLAIAEN
Goldbacteria_PGYV01_sp002839855                  RTKVKTKTVKKEFFHLEPLAITKKTMGVAKN
Calescibacterium_sp002898315                     ARRVWTEFIIDKYFLVRPFLTQKTVHGIAKD
Armatimonadota_HRBIN17_sp002898575               RERVITQVVRLDYVLAAPLATNMTVHGIAVE
Eremiobacteria_Palsa-1478_sp003168375            RRRVTQVTIRDEYFLVQPLVTQRKCHGIAES
Rokubacteriales_20CM-2-70-11_sp003220315         RTRVRARSVREDYFLIQPLVTNQTVHGIARK
Sumerlaea_chitinivorans                          RTRVRRKEVKNEYFILQPLETHKTVHGIAEA
Abyssubacteria_SURF-17_sp003598055               RRRVVSKDVKEEYFLIQPFATQQNVHGIAAD
Aureabacteria_SURF-26_sp003599815                RTRVSTKTVRDDYFWVEPAITKKTVHGIAQN
Thiomargarita_sp003645255                        RSRVRQRVVLDNYFAVQPLVVDTSVHGLTVG
Aerophobales_AE-B3B_sp005223085                  RKKVKQNTVRENYFNVEPLKVGKHVLGLSKE
Dormibacterales_40CM-4-65-16_sp005882135         RNRVQERTVKLAYLLAAPLVTGKKVHGIARD
Thiomicrorhabdus_sp006222135                     -------------------------------
Chlorobaculum_tepidum                            RTRVSKKVVLEEYFKAKPLVAEDNVLAIAES
Prochlorococcus_marinus                          RTRVITKTIRDHYLYVAPLTLGANVQGAAED
Thermotoga_maritima                              RTRVREKKVKNDYFWAEPLVTNKRVLGIAQN
Aquifex_aeolicus                                 KERVLRTIVDREYVLIYPFVTGKTVYGVAQD
Rhizobium_leguminosarum_L                        KNRVKSKTVKAEYFLLQPIAAAQTVHGVSYG
Porphyromonas_gingivalis                         RARVSSKVIREQYFLVQPLKLDKNLLAIAKD
Hydrogenobacter_thermophilus                     RERVLTKIVKKDYVLIYPLLTGKTVYGIGKD
Deferribacter_desulfuricans                      RTRVKNNTVKDEYFLLQPLTVNKTVHGIAAD
Hydrogenovibrio_crunogenus_A                     RNRVRSKTVRTEYFDIQPLATEKTVHGISVD
Nitrosococcus_oceani                             RTRVSSNLIREEYFSVQPLLVEKFVHGIASI

You have to make sure that all sequences start at the same position on the right side, just like shown above. Also, seqboot will trim the names down to 10 characters, so that may create some non-unique names for downstream applications.

What I do is replace each of these names with random 10-character strings, run them through seqboot and all other programs, and once the reconstructions are done rename the trees back with original names. Something like this:

 146 5040
MOkJHaTCnp   RSRVRSNTVKGDFFAVQPLTVDKTVLGIAQGLSKSLTLHAFSILIQK
TdXao6QJyK   RTKVTEKTVREEYFWVKPLLTGRRVHGIALNLVKKLTQRRFSILILH
3P67muinBf   KERVSKSTVKEQYLLVQPLLTDKRALGLSMDWTKDLTQIRYSILISH
jBqW3K5pMP   RRRVKSKIVKEEYLLARALVTGMLCHGIAENLTRNLTKPRFSILIAH
HBXgYurMUL   ARKVWTEYVIDKYFLVRPLVTNKEVHGISKELIKPLTRRTWKIIIIK
YDpw8rAsES   RSSVIKSKIRLNYFLISPFAVGDVALAISNANTRDLTLRAFSILITH
JElcsrVOhM   RSRVRKEVVLEKYFAVGLITIPRTVLGIAQELTRSLTRKGFVILIVK
KU4kg70AoH   STKITTVKVRLQYFLVLPLTAGNKIQGIAKELTKSLKLARFSILICK
SNXP7iAICy   RTKVIDKTVREEYFHVKPLLLGKLLLGIAQDYLAQLTRARFSVLINK
gfYbF7UmvB   RTRVREKTVKEDYFLIHPLLTNKTVHGIALNLLRKLTQKRFSILVLH
Q0m6wg8BOD   KTRLTDKVVREAYFYVQPMVTGKTVHGLATDYPHRLTHRSYSILIIQ
UBG1KI5tA3   RRRVFTDSVREDFFHIQPFATKRSVLGIAVKLVRKVTKKSYTILIAK
Jmu5jLOtMY   RTRAAKNVIKKDYFFVQPLVTNSNVHAIALKNPKKLTQKRFSILIIH
A4r1oCLFT5   KTRVRKNIVRDEYFQIKPLLTGKRVLGIADELIKKLTLRAYHILIIA
m6JeQU5EIG   RERVKERWVRAEYFVLTPFITGRRVLGIAGEFRKDLTRRKYSILIIK
CNkW8AJRpg   RDRVRKEVVREDYFLLQPLLTEKNVHGISLDSVKKLTQPKFSILINK
c8gFKSwkxs   RSRVRSKVIREQHFDVQPLVTARNVHGLAAENLRALTLHRFSILIQH
2QDFt1wdW3   SRRLKRRYILQEYFLAEPLITGSKVLALADNLVKKLSRSSYSILIVH
VsjRvxJ5Xc   RRRVKSATVKESYFLVQPLLSEKTVLGLAQELSKKLTLHRFSVLLQH
8tJBi20n3a   RCRVEEKQIRKEYLLVSPFHLGKTALGIAEEFPRALTRKGFVILIAL
cEqIeP6Naj   RERVRERWVREEYFILTPFITGRRVLGIAGEFRKDLTRRKYSILIIK
EMvuaw7hTm   SRRLKKRYILQEYFLAEPLITGSKVLALATNLVKKLSKSSYSIIIIH
DbnXF7kIas   RTKVATKTVLDAWLKVQPLLTKSTTHGISASFIKPLTKRSFSILIAL
D38B6wqnip   ------------------------------------------ILISH
f2xLVyRcdw   RTRVKKLSVKENFFYVQPLAGEQRVMGLAAELVRALTRPRFSVLVGH
UekDIYX0RK   RSRCRSRVVKVEYFQAEPLTVKSNVMGIAEDFPKALRQPRFSILVSH
XDoqGWl5UI   RTRAKSKLVRENYFQAKPLLTDKNVLAIAENLIRKMTRRRFSILIAH
vI487LYMq0   RTKVKTKTVKKEFFHLEPLAITKKTMGVAKNTISKLMKKSFSILINK
Fwp5EOMlgD   ARRVWTEFIIDKYFLVRPFLTQKTVHGIAKDLIKLLTRRTWKILILK

ADD REPLY • link 17 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Excellent, thanks for taking the time to elaborate! The problem was with the sequence headers. When changed to unique 10 character strings it works. Best