Hey All,
I am trying to infer a phylogeny from a multiple sequence alignment using FastTree program, however the program is giving me an error when I run it over the multiple sequence alignment and I can not figure out what the error is saying (not really that informative).
My original alignment has a total of 3,307 species (sequences), and 95,678 columns. I was able to infer a phylogeny successfully using FastTree over this alignment in about 16 hours. However upon checking my alignment I noticed it is highly fragmented and of very poor quality because I do not do any post processing after I concatenate my individual alignments obtained from MUSCLE.
I tried to post-process the alignment file using Gblocks however I think the default parameters are too conservative and it is resulting in an empty alignment file. I made my own python script to improve the quality of this alignment, and this is where the problem starts I think:
Basically what I do in my script is I take in the MSA and check 3 things:
- I remove all the columns that have gaps in more than half of the sequences.
- I remove all the columns that dont have a consensus amino acid in at least 25% of the sequences
- finally I remove all the rows (i.e. the species) that have gaps in more than half of their total sequence length
After these filtering criteria the script outputs a new multiple sequence alignment with the suffix '_clean_columns_clean_rows.fasta'. The new multiple sequence alignment now has 3,294 rows (i.e. species) and 30,634 columns.
Now I issue a FastTree command to infer a phylogeny over this new alignment with the following paramters:
FastTree -lg -pseudo -spr 4 -mlacc 2 -slownni supermat_clean_columns_clean_rows.fasta > supermat_clean_columns_clean_rows_FastTree.tree
I use the -pseudo -spr 4 -mlacc 2 -slownni
parameters on purpose to perform a rigorous search and to get a tree as accurate as possible using the heuristic search. The program after running sometime it starts printing values for Eigen vectors such as this:
eigeninv = c( 1.0797265, -0.1849245, -2.2736955, -1.3477029, ....
Followed by this error which I cant understand what it means:
# Transform into matrices and compute un-rotated vectors for profiles A and B
codeFreq = matrix(codeFreq,nrow=20);
eigeninv = matrix(eigeninv,nrow=20);
unrotA = stat * (eigeninv %*% fA)
unrotB = stat * (eigeninv %*% fB)
# End of R block
FastTree: /usr2/people/mprice/Genomics/html/fasttree/FastTree.c:5383: PairLogLk: Assertion `lkAB > 0' failed.
Aborted
To reproduce this result I am sharing links for both the original alignment and the post-processed alignment using my python script:
original alignment file "supermat.fasta" ( ~300 MB): https://iu.box.com/s/xx0mbvgyh50e1isw7nh8lylrejg4pe3x
post-processed alignment file "supermat_clean_columns_clean_rows.fasta" (96 MB) : https://iu.box.com/s/b8i97a39bp8pnna0hhltoa0xm7xqwqz3
I am also pasting the python script that I wrote to perform these post processing here Any help is greatly appreciated (many thanks for your time!!):
from Bio import SeqIO
import numpy as np
import copy
from collections import Counter
msa_in_f = 'allBinsWithArchaea_FastTree_2/supermat.fasta'
prefix_f, suffix_f = msa_in_f.rsplit('.')[0], msa_in_f.rsplit('.')[1]
msa_seqs = list(SeqIO.parse(msa_in_f, 'fasta'))
headers = [str(item.id) for item in msa_seqs]
seqs = [str(item.seq) for item in msa_seqs]
np_headers = np.array(headers)
concat_seqs = ''.join(seqs)
matrix = np.array(list(concat_seqs), dtype=str).reshape(len(headers), len(seqs[0]))
clean_mat = copy.deepcopy(matrix)
n_rows, n_cols = matrix.shape
#remove columns having more than 50% gaps
remove_cols = list()
for i in range(n_cols):
col = str(''.join(matrix[:, i]))
if col.count('-') > round(n_rows/2):
remove_cols.append(i)
clean_mat = np.delete(matrix, remove_cols, 1)
#remove columns where the consensus amino acid is less than 25%
remove_cols = list()
for i in range(clean_mat.shape[1]):
col = str(''.join(clean_mat[:, i]))
char_counter = Counter(col)
del char_counter['-']
if max(char_counter.values())/n_rows < 0.25:
remove_cols.append(i)
clean_mat = np.delete(clean_mat, remove_cols, 1)
with open(prefix_f + '_clean_columns.'+ suffix_f, 'w') as out_f:
for i in range(clean_mat.shape[0]):
row = str(''.join(clean_mat[i, :]))
out_f.write('>'+str(headers[i])+'\n')
out_f.write(row+'\n')
#remove rows where the the alignment is more than 50% gaps
new_n_rows, new_n_cols = clean_mat.shape
remove_rows = list()
for i in range(clean_mat.shape[0]):
row = str(''.join(clean_mat[i, :]))
if row.count('-') > new_n_cols/2:
remove_rows.append(i)
clean_mat_reduced_rows = np.delete(clean_mat, remove_rows, 0)
np_headers_reduced_rows = np.delete(np_headers, remove_rows)
with open(prefix_f + '_clean_columns_clean_rows.'+ suffix_f, 'w') as out_f:
for i in range(clean_mat_reduced_rows.shape[0]):
row = str(''.join(clean_mat_reduced_rows[i, :]))
out_f.write('>'+str(np_headers_reduced_rows[i])+'\n')
out_f.write(row+'\n')
print('original input matrix shape,', matrix.shape)
print('clean columns matrix shape', clean_mat.shape)
print('clean columns, clean rows matrix shape', clean_mat_reduced_rows.shape)
with open(prefix_f + 'reduction_stats.txt', 'w') as out_f:
out_f.write('original input matrix shape,' + str( matrix.shape)+'\n')
out_f.write('clean columns matrix shape' + str(clean_mat.shape)+'\n')
out_f.write('clean columns, clean rows matrix shape' + str( clean_mat_reduced_rows.shape)+'\n')
It is very likely that the error is caused by your alignment manipulations. I suggest you try trimAl which is a tool for alignment trimming. It can do at least 2 out of 3 things you want.
thank you for your suggestion. I don't understand it however since after manipulation every row has the same number of columns and hence it is a standard MSA file, what is the actual reason for it breaking the program?
UPDATES! : I get the same error after using trimAL software to post-process the alignment
Separately from my previous answer, I don't understand the logic of doing 3000+ species unless you are trying to create some kind of authoritative resource. Removing redundant sequences at 80-90% identity will almost certainly cut it down to no more than 20% of your starting number. Does it really matter to you whether you have 20 Staphylococcus species on the same branch as opposed to 1-2 representative sequences? Not many people will look at the tree with 3000+ species without collapsing it because it is difficult to do so.
Along the same lines, it is unlikely that you will get better tree reconstruction from 90K+ poorly aligned columns than you would from 10-20 K well-aligned columns. I think you will make your life easier if you inspect individual alignments and remove poor ones before concatenation.
yes I want to keep species as much as possible I am interested in strain level differences as well that is the main reason. I am working on reducing the 90k column alignment, and using those 3 criteria it's reduced to 30k. But I will try and refine the alignment using that software you recommended.
I just doesnt really make much sense to me that FastTree was ran successfully on the original alignment of dimensions 3k x 90k and breaks on the smaller sized matrix.