Question

muscle v3.8.31 large sequence Segmentation fault

2

Entering edit mode

8.6 years ago

singh.punit1990 ▴ 30

Hi, I am trying to run muscle on my centos 6.7 system having Xeon E5-2680 v3 + 128 GB ram, and i am running into segfault with "large" no. of sequences:-

./muscle -in 1_seq.fa -out 1_seq.afa -maxiters 1 -diags1 -sv

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscle This software is donated to the public domain. Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

pos_ep_1_seq 276188 seqs, max length 51, avg length 51 Segmentation fault (core dumped)

Whereas for smaller sequence it works fine:

18_seq 8417 seqs, max length 51, avg length 51 00:00:05 586 MB(-53%) Iter 1 3.30% K-mer dist pass 1

I also tried compiling from muscle 3.8 from source , but i got segfault with large sequences. I have also done ulimit -s unlimited.

i recompiled with -g , and i got:

pos_ep_1_seq 276188 seqs, lengths min 51, max 51, avg 51

Program received signal SIGSEGV, Segmentation fault. 0x000000000040f93c in DistFunc::SetDist (this=0x7fffffffdfb0, uIndex1=11825, uIndex2=11825, dDist=0) at distfunc.cpp:54 54 m_Dists[VectorIndex(uIndex1, uIndex2)] = dDist; Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libstdc++-4.4.7-11.el6.x86_64

(gdb) where

0 0x000000000040f93c in DistFunc::SetDist (this=0x7fffffffdfb0, uIndex1=11825, uIndex2=11825, dDist=0) at distfunc.cpp:54

1 0x000000000041802c in DistKmer4_6 (v=..., DF=...) at fastdistnuc.cpp:109

2 0x00000000004156a9 in DistUnaligned (v=..., DistMethod=DISTANCE_Kmer4_6, DF=...) at fastdist.cpp:30

3 0x0000000000415378 in TreeFromSeqVect (v=..., tree=..., Cluster=CLUSTER_UPGMB, Distance=DISTANCE_Kmer4_6, Root=ROOT_Pseudo, SaveFileName=0x0) at fastclust.cpp:69

4 0x000000000041070b in DoMuscle () at domuscle.cpp:159

5 0x0000000000410c30 in Run () at domuscle.cpp:281

6 0x00000000004246a4 in main (argc=9, argv=0x7fffffffe318) at main.cpp:60

so , within fastdistnuc.cpp's the loop crashes when uSeq1=11825 at 109th line :

// Initialize distance matrix to zero

    for (unsigned uSeq1 = 0; uSeq1 < uSeqCount; ++uSeq1)

            {

  (109)          DF.SetDist(uSeq1, uSeq1, 0);

            for (unsigned uSeq2 = 0; uSeq2 < uSeq1; ++uSeq2)

                    DF.SetDist(uSeq1, uSeq2, 0);

            }

in distfunc.cpp: 54rth line causes the issue.

void DistFunc::SetDist(unsigned uIndex1, unsigned uIndex2, float dDist) { 

(54=>)   m_Dists[VectorIndex(uIndex1, uIndex2)] = dDist;

         m_Dists[VectorIndex(uIndex2, uIndex1)] = dDist;  }

Can you hint at what might be the reason for the SEGFAULT at this line. To me it seems to be acessing illegal memory & return value of VectorIndex(uIndex1, uIndex2) seems culprit here (out of bounds!!).

Any help/hint will be very useful. Eagerly awaiting your replies.

software error segmentationfault • 8.1k views

ADD COMMENT • link updated 8.6 years ago by kloetzl ★ 1.1k • written 8.6 years ago by singh.punit1990 ▴ 30

0

Entering edit mode

Not answering your question but have you looked at this?

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

when segfault occurs: total 125 used 30 free 95 (units in GB)

ADD REPLY • link 8.6 years ago by singh.punit1990 ▴ 30

1

Entering edit mode

Is there a reason you are trying to align what appears to be 276188 sequences? It is possible to reduce the size of the dataset as indicated in the link above? Even if this MSA was to work as is, it would take forever with those many sequences.

ADD REPLY • link 8.6 years ago by GenoMax 147k

1

Entering edit mode

There is a reason for 276188 sequences (i can explain a bit later) Meanwhile, i can see that there is sufficient system memory available when the code crashes, can you possibly guess any other reasons for segfault here (acessing illegal memory)?

Time is not main concern right now as i can parallelize loops (to reduce time) once i get across this error!

ADD REPLY • link 8.6 years ago by singh.punit1990 ▴ 30

0

Entering edit mode

I assume the error is reproducible and happens at the same step (not a corrupt sequence somewhere in your file?)
Like @gearoid suggested below you may want to try a different program since you want to keep your dataset intact for this analysis.

ADD REPLY • link 8.6 years ago by GenoMax 147k

0

Entering edit mode

With regards to it taking forever, it's a lot of sequences, but they're quite short, so it's probably not going to be as bad as you'd expect. To get an idea, I just timed Clustal Omega aligning 194775 amino acid sequences of average length 40 (WD40 repeat from Pfam) and it took about 2.5 hours (real time, user was only a few minutes longer).

Obviously it's 80k more sequences, and they're 25% longer, but I'd say you're talking about something on the order of a day, not weeks or anything.

But you're right about reducing the size of the input, if that's possible. I'd guess that with ~275k sequences of length 51 you're going to have lots and lots of sequences that are 100% identical.

ADD REPLY • link 8.6 years ago by gearoid ▴ 200

score 1 · Answer 1 · 2016-04-07

I believe this is just an issue with MUSCLE -- it crashes for very large numbers of sequences. I'm not sure exactly how many is too many, but it's on the order of tens of thousands of sequences.

As far as I know, MAFFT, Clustal Omega and UPP should be capable of aligning 276188 sequences, so the simplest solution is probably to switch to using one of those. You might want to use the --auto flag with MAFFT and let it pick parameters, I'm not sure how well the default script scales to that many sequences. Clustal Omega with default parameters should work. Same for UPP, I think.

score 1 · Answer 2 · 2016-04-08

The actual problem surfaces in vectorIndex:uIndex1*m_uCount + uIndex2. All variables are unsigned int, thus 32 bit. Unfortunately, a 276188 by 276188 matrix has more entries than can be represented in 32 bit. So the multiplication will overflow at some point and thus trigger undefined behaviour. So the real bug here is that muscle did not validate its input. The biggest number of sequences muscle can possibly use is somewhere at 2^16.