Hello,
I've been wracking my brain these past few weeks to figure out why Clustal Omega gives me a segmentation fault and a core dump when I run a small number (20-30 DNA sequences) of large plasmids (>=50,000bp) through it. To my understanding, I am not going over the file size limit for Clustal to run, so it leads me to believe that this issue is due to me running very long sequences through clustal, rather than a huge number of relatively small sequences. Has anyone else tried aligning very long sequences through Clustal Omega and gotten this same error? I've been giving it plenty of memory, hundreds of Gigabytes worth, so I don't believe it is due to memory. I'm just at a loss!
Indeed Clustal Omega is not designed for very long sequences.
What are you trying to do?
I'm trying to align these very, very long sequences so I can see where they align and how similar they are to one another. At least this confirms my hunch that Clustal is just not designed for very long sequences. Do you know of any alternative multiple sequence alignment programs I could use to align these very, very long sequences of mine?
As above, CLUSTAL is only really designed to align a few thousand base pairs, and even then is better suited to protein alignments. As such its use cases are more typically for a large number of single gene alignments.
Multiple sequence alignment of a large number of larger sequences remains something of an unsolved problem in bioinformatics.
You may have some luck if you read here: MSA of very long sequences? , but I'd hazard a guess and say you simply have too much data. It will take forever and will probably be a rubbish alignment anyway. You will likely need to find a new approach.
That makes sense. I'm looking for a highly conserved few genes or nucleotide regions among many, if not all, conjugative plasmids. My attempts to find these regions of conservation or homology has been to do MSA to see where they line up. I was hoping at the end of this to find a series of nucleotide regions, in a gene or otherwise, that CRISPR-Cas9 could cleave. If MSA is unwieldy when it comes to analyzing sequences like these, may I ask if you have any suggestions for how to go about answering this computationally?
hey there,
A possible solution would be proressiveMAUVE to look for regions conserved between your plasmids.
edit: 25-30 sequence might be still too high for MAUVE. Therefore, if mauve does not work, try to remove 'redundant' plasmids by calculating the Average Nucleotide Identity.
Mauve is not a bad idea, but I suspect may still be slow (and is not a hugely beginner friendly commandline tool).
You would probably be better served looking at something like kmer distances/sketching such as here. This 'abstracts' the problem out somewhat, but should still answer your overarching question. If you need to drill down on specific identity differences you can potentially use this to subset some of the data etc.
This thread is a decent explainer if this is new to you.
I would also just ask - are you certain you need multiple sequence alignment? If doing multiple pairwise alignment is sufficient, this gives you a lot more options. If you are trying to align many plasmids from disparate sources - which it sounds from your comment like you might be - I would doubt you'll really get a good true MSA anyway.