*Edit: The problem was incompatible sequence headers (see below).
I need to run bootstrap analysis using FastTree with multiple protein sequence alignments. I've used seqboot before for this with multiple dna sequence alignment, but it doesn't seem to be able to sample protein alignments. Does anyone know of a command line program I can use to generate randomly sampled multiple protein alignments?
In my hands seqboot works with protein sequences - see below. Is there an error message when you try to use it?
seqboot
seqboot: can't find input file "infile"
Please enter a new file name> bacteria-original.phy
Bootstrapping algorithm, version 3.697
Settings for this run:
D Sequence, Morph, Rest., Gene Freqs? Molecular sequences
J Bootstrap, Jackknife, Permute, Rewrite? Bootstrap
% Regular or altered sampling fraction? regular
B Block size for block-bootstrapping? 1 (regular bootstrap)
R How many replicates? 100
W Read weights of characters? No
C Read categories of sites? No
S Write out data sets or just weights? Data sets
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these or type the letter for one to change
Y
Random number seed (must be odd)?
3267
completed replicate number 10
completed replicate number 20
completed replicate number 30
completed replicate number 40
completed replicate number 50
completed replicate number 60
completed replicate number 70
completed replicate number 80
completed replicate number 90
completed replicate number 100
Output written to file "outfile"
Done.
Excellent, the error I'm seeing must be due to something else then:
ERROR: sequences out of alignment at site 103 of species 22
The sequences in the alignment FASTA file are all the same length. This position is a gap -
I'm using relaxed phylip format with long sequence headers. Maybe seqboot can't work with relaxed format?
seqboot: can't find input file "infile"
Please enter a new file name> OG0000006.fa.phylip
Bootstrapping algorithm, version 3.697
Settings for this run:
D Sequence, Morph, Rest., Gene Freqs? Molecular sequences
J Bootstrap, Jackknife, Permute, Rewrite? Bootstrap
% Regular or altered sampling fraction? regular
B Block size for block-bootstrapping? 1 (regular bootstrap)
R How many replicates? 100
W Read weights of characters? No
C Read categories of sites? No
S Write out data sets or just weights? Data sets
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? ANSI
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these or type the letter for one to change
Y
Random number seed (must be odd)?
1
ERROR: sequences out of alignment at site 103 of species 22
ERROR: sequences out of alignment at site 103 of species 22
Don't know what in that error message made you think this is about protein sequences specifically. Sounds to me like an error in alignment formatting. It is telling you exactly where to look for that error: sequence #22 from top and column position 103. Something there is not what it should be, and my educated guess is that you will have a different gap character than - that you claim is in the alignment.
In my hands seqboot works with phylip files and long sequence headers. For example this alignment works (trimmed both for width and length to save space):
You have to make sure that all sequences start at the same position on the right side, just like shown above. Also, seqboot will trim the names down to 10 characters, so that may create some non-unique names for downstream applications.
What I do is replace each of these names with random 10-character strings, run them through seqboot and all other programs, and once the reconstructions are done rename the trees back with original names. Something like this:
Excellent, thanks for taking the time to elaborate! The problem was with the sequence headers. When changed to unique 10 character strings it works. Best
Excellent, the error I'm seeing must be due to something else then:
The sequences in the alignment FASTA file are all the same length. This position is a gap -
I'm using relaxed phylip format with long sequence headers. Maybe seqboot can't work with relaxed format?
Don't know what in that error message made you think this is about protein sequences specifically. Sounds to me like an error in alignment formatting. It is telling you exactly where to look for that error: sequence #22 from top and column position 103. Something there is not what it should be, and my educated guess is that you will have a different gap character than
-
that you claim is in the alignment.In my hands
seqboot
works with phylip files and long sequence headers. For example this alignment works (trimmed both for width and length to save space):You have to make sure that all sequences start at the same position on the right side, just like shown above. Also,
seqboot
will trim the names down to 10 characters, so that may create some non-unique names for downstream applications.What I do is replace each of these names with random 10-character strings, run them through
seqboot
and all other programs, and once the reconstructions are done rename the trees back with original names. Something like this:Excellent, thanks for taking the time to elaborate! The problem was with the sequence headers. When changed to unique 10 character strings it works. Best
Please upvote and accept the answer.