Hi, I am trying to run muscle on my centos 6.7 system having Xeon E5-2680 v3 + 128 GB ram, and i am running into segfault with "large" no. of sequences:-
./muscle -in 1_seq.fa -out 1_seq.afa -maxiters 1 -diags1 -sv
MUSCLE v3.8.31 by Robert C. Edgar
http://www.drive5.com/muscle This software is donated to the public domain. Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
pos_ep_1_seq 276188 seqs, max length 51, avg length 51 Segmentation fault (core dumped)
Whereas for smaller sequence it works fine:
18_seq 8417 seqs, max length 51, avg length 51 00:00:05 586 MB(-53%) Iter 1 3.30% K-mer dist pass 1
I also tried compiling from muscle 3.8 from source , but i got segfault with large sequences. I have also done ulimit -s unlimited.
i recompiled with -g , and i got:
pos_ep_1_seq 276188 seqs, lengths min 51, max 51, avg 51
Program received signal SIGSEGV, Segmentation fault. 0x000000000040f93c in DistFunc::SetDist (this=0x7fffffffdfb0, uIndex1=11825, uIndex2=11825, dDist=0) at distfunc.cpp:54 54 m_Dists[VectorIndex(uIndex1, uIndex2)] = dDist; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libstdc++-4.4.7-11.el6.x86_64
(gdb) where
0 0x000000000040f93c in DistFunc::SetDist (this=0x7fffffffdfb0, uIndex1=11825, uIndex2=11825, dDist=0) at distfunc.cpp:54
1 0x000000000041802c in DistKmer4_6 (v=..., DF=...) at fastdistnuc.cpp:109
2 0x00000000004156a9 in DistUnaligned (v=..., DistMethod=DISTANCE_Kmer4_6, DF=...) at fastdist.cpp:30
3 0x0000000000415378 in TreeFromSeqVect (v=..., tree=..., Cluster=CLUSTER_UPGMB, Distance=DISTANCE_Kmer4_6, Root=ROOT_Pseudo, SaveFileName=0x0) at fastclust.cpp:69
4 0x000000000041070b in DoMuscle () at domuscle.cpp:159
5 0x0000000000410c30 in Run () at domuscle.cpp:281
6 0x00000000004246a4 in main (argc=9, argv=0x7fffffffe318) at main.cpp:60
so , within fastdistnuc.cpp's the loop crashes when uSeq1=11825 at 109th line :
// Initialize distance matrix to zero
for (unsigned uSeq1 = 0; uSeq1 < uSeqCount; ++uSeq1) { (109) DF.SetDist(uSeq1, uSeq1, 0); for (unsigned uSeq2 = 0; uSeq2 < uSeq1; ++uSeq2) DF.SetDist(uSeq1, uSeq2, 0); }
in distfunc.cpp: 54rth line causes the issue.
void DistFunc::SetDist(unsigned uIndex1, unsigned uIndex2, float dDist) { (54=>) m_Dists[VectorIndex(uIndex1, uIndex2)] = dDist; m_Dists[VectorIndex(uIndex2, uIndex1)] = dDist; }
Can you hint at what might be the reason for the SEGFAULT at this line. To me it seems to be acessing illegal memory & return value of VectorIndex(uIndex1, uIndex2) seems culprit here (out of bounds!!).
Any help/hint will be very useful. Eagerly awaiting your replies.
Not answering your question but have you looked at this?
when segfault occurs: total 125 used 30 free 95 (units in GB)
Is there a reason you are trying to align what appears to be 276188 sequences? It is possible to reduce the size of the dataset as indicated in the link above? Even if this MSA was to work as is, it would take forever with those many sequences.
There is a reason for 276188 sequences (i can explain a bit later) Meanwhile, i can see that there is sufficient system memory available when the code crashes, can you possibly guess any other reasons for segfault here (acessing illegal memory)?
Time is not main concern right now as i can parallelize loops (to reduce time) once i get across this error!
I assume the error is reproducible and happens at the same step (not a corrupt sequence somewhere in your file?)
Like @gearoid suggested below you may want to try a different program since you want to keep your dataset intact for this analysis.
With regards to it taking forever, it's a lot of sequences, but they're quite short, so it's probably not going to be as bad as you'd expect. To get an idea, I just timed Clustal Omega aligning 194775 amino acid sequences of average length 40 (WD40 repeat from Pfam) and it took about 2.5 hours (real time, user was only a few minutes longer).
Obviously it's 80k more sequences, and they're 25% longer, but I'd say you're talking about something on the order of a day, not weeks or anything.
But you're right about reducing the size of the input, if that's possible. I'd guess that with ~275k sequences of length 51 you're going to have lots and lots of sequences that are 100% identical.