I have these sequences:
>a
GCATCCCGATGGTCACGGTCGCCACCAGCCTCGCCAACGACGGCA
>b
GTCCGGCCATGCCCAGCAGCACGGCAAGGGTCAGGCCCACAATCGGA
>c
GTGCAAACGGATGCCACGCGCGACGCACGCTTCGCCGCGGCGTTTAGC
>d
GTAGGCCGGGCCGAAGGCGCGACGCTTGTATGCGGAGGCGAACTGCTTGCGA
>e
GCTGCTCCAGTAACCGCCCTATGGTCGGGGTCACATTCTGGCCCTGGCTCCA
I want to make a multiple alignment. That's what I get from Clustal Omega:
------GCATCCCGATGGTCACGGTCGCCACCAGCCTCGCCAACGACGGCA------------
GTCCGGCCATGCCCAGCAGCACGGCAAGGGTCAGGCCCACAATCGGA----------------
--------------GTGCAAACGGATGCCACGCGCGACGCACGCTTCGCCGCGGCGTTTAGC-
------GTAGGCCGGGCCGAA-----GGCGCGACGCTTGTATGCGGAGGCGAACTGCTTGCGA
------GCTGCTCCAGTAACCGCCCTATGGTCGGGGTCACATTCTGGCCCT----GGCTCCA-
That's what I get from t-coffee:
--------------GC------------ATCCCGATGGTCACGGTCGCCACCAGCCTCGCCAACGACGGCA
G-TCCGG-------CCATGCCCAGCAGCACGGCAAGGGTCAGGCCCACAATCGGA----------------
GTGCAAACGGATGCCACGCGCGACGCACGCTTCGCCGCGGCGTTTAGC-----------------------
G-----TAGGCCGGGCCGAAGGCGCGACGCTTGTATGCGGAGGCGAACTGCTTGCGA--------------
--------------GCTGCTCCAGTAACCGCCCTATGGTCGGGGTCACATTCTGGCCCTGGCTCCA-----
That's what I get from an algorithm I designed and developed in one day while watching tv (also tested with larger sequences and larger datasets which ClustalO and t-Coffee refuse for being too long or crash when aligning):
G--CAT-C-CC-G--ATG-GTC--ACGGTC-GC---CAC--CA-GCCTC-G-CCAA--C-GA-C-G----GC-A
G----T-C-CG-GCCAT--G-C--CCAG-CAG----CACGGCAAGGGTCAGGCCCA--C-AATC-G----G--A
GTGCAA-A-CG-G--AT--GCC--AC-G-C-GC---GAC-GCACGCTTC-G-CC-G--C-GG-C-GTTTAGC--
G--TAG-G-CC-G----G-GCC-GAAGG-C-GC---GAC-GCTTGTAT--G-CGGAGGC-GAACTG-CTTGCGA
G--C-TGCTCCAG-TAACCGCCCTATGGTC-GGGGTCA---CA-TTCTG-G-CC----CTGG-C-T----CC-A
It's only me who finds the last alignment as the most accurate? Am I missing something?
As I take from some comments, looks like the problem here is that my algorithm and the other ones are designed for different goals. Mine is (actually) conceived to align non-conserved sequences while the other two are mostly designed to align conserved regions at which INDELs are less frequent than SNPs. And due to that different goals my impression on which alignment is best seems to be biased.
Likely yes given that alignment of unconserved regions is not widely explored.
Yes. That people are usually (by far) more interested on genes and proteins than on DNA in general.
Therefore, looks like that most people will be safe for now using the other softwares for their usual needs.
IMO the thing you're missing is that opening a new gap is a lot more expensive than expanding on an existing gap or putting up with a mismatch. Biology works by keeping what works, and what works is a functional unit that does things. Genes are functional units, and so are protein domains. Changes that affect those units are selected against in nature, so while your algorithm makes the sequences look more "aligned" in the sense of Microsoft Word's Justify setting, it is not biological alignment.
I am not aligning genes... Biology is not only about genes. The biomedical industry is what seems to be only about genes. People here are biased, And they don't seem to take seriously any scientifical method. Sorry if I generalize. People who do not agree with all bullshit displayed here should say that if they don't agree.
It doesn’t matter whether you’re aligning genes or not, indels are still less frequent than polymorphisms, and your algorithm doesn’t reflect that.
Now, please stop calling everyone here biased because we haven’t agreed with you. If you want honest feedback, we have provided it - if you don’t, you’re welcome to go elsewhere.
We are all trying to keep this discussion on track and not take things personally, but incensing statements make that difficult. I agree with you that biology is not just about genes. However, we are aligning sequences here, so there needs to be a reason we align sequences as well as a metric to measure how close we are to that goal. What is your goal/reason for aligning these three sequences, and what metric are you using to show that your algorithm got us closer (than the other two algorithms) to that goal?
If your algorithm is better to reach that goal, it is better - as simple as that. It does need to be a well-defined and biologically relevant goal though, or we are not solving a biological problem.
To be blunt, yours is the worst there.
As ATpoint commented, gaps and gap extensions are (and should be) much more costly than mismatches to reflect biological reality (SNPs are more common than INDELs).
An alignment as ‘gappy’ as yours would be a nightmare to do subsequent phylogenetic analysis with (where entire or mostly gap columns are often discarded since they lack evolutionary signal).
I can only agree with ATpoint and Joe here I'm afraid.
apart from the points they have already made, I want to add one more and that is that it makes no sense to align random sequences to eachother. Where did you got the sequences from you are using, from a gene family? If you really want to benchmark or showcase your tools performance you take sequences from a well defined gene family and align those. There are many papers and such on this specific topic (nothing really super recent though but still all valid)
These are not random sequences. These are sub-sequences of wider sequences which were found by finding local alignments of a SINE sequence into a genome (if I remember well). Therefore, the sequences in which they are included are likely related. I am not interested on aligning conserved regions. If they are conserved then they will obviously have more SNPs than INDELs. Not sure in other cases.
pfff, not sure where to start but it's apparent (and I mean this in the most helpful way) you are lacking a serious level a basic (molecular) biology, especially to get involved in such a topic.
Sorry, but this is a totally wrong assumption . Moreover, aligning (or working with) transposon and TE-related sequences is even an extra level of difficulties.
"funny" you mention this as this is exactly what people do to benchmark alignment methods as for those we at least have some good indication what the result should be like.
So, finding two almost dentical SINE sequences in two different places of a genome and assuming they are related means "Sorry, but this is a totally wrong assumption". Could you expand on that?
Yes but let me first mention that the statement you make here in your comment is not the same as the one in the original I quoted.
The SINE elements themself might be related but they don't necessarily have to be! SINE, just as many other transposon/TE sequences are part of big families and it's up to now still not really clear if they all share a common origin (hence being related). Despite this the location where they integrate into a genome are not related simply because they have the same kind of SINE in it. Integration sites of SINE do not impose biological relationship.
You could in theory align the SINE sequences (there is at least some sort of relationship) , however they are prone to rapid mutation accumulation making aligning them difficult.
Anyway this is all a minor comment compared to the ones others have posted here (and which I fully support) , the main one being that there is a biological reasoning behind doing sequence alignments that can not be neglected.
Not really my field but shouldn't an alignment algorithm try to introduce as few gaps as possible to find the best overall alignment between sequences. I think this is biologically meaningful as in e.g. protein-coding sequences in an evolutionary context gaps would probably lead to frameshifts. While a few might make sense your alignment is basically gapping sequences within the , lets call it here coding or "core" sequence until they fit to each other. In contrast the other two algorithms try to keep those "sub-sequences" with local similarity gap-free and rather introduce them at the ends. I think keeping things as ungapped as possible in the "core" sequence is the better a priori assumption rather to force sequences to match each other over the full length.
I cannot really think of a meaningful biological process where that many gaps would indicate a conservation, be it protein-coding functions, transcription factor binding etc. What makes you think that this is a good strategy? Again, not really my core field, just thinking aloud.
Edit:
That's what I get from an algorithm I designed and developed in one day while watching tv
::: Not sure why you write something like this but to be honest, this is exactly how this alignment of yours looks like. If you do not like people giving snarky comments, maybe better avoid these kind of sentences, to keep everything serious.That usually helps to instantly identify those who do not have honest intentions writing an answer. They usually lose their cool for no apparent reason. Apart from that it's (more or less) the truth and it's usually healthy to display the truth . Glad it "helped" you to identify "how this alignment of yours looks like"
If ATpoint was prejudiced in his answer, that's because you wrote your post in, what appears to be an arrogant manner. It alienates people. You might want to consider this in your future interactions.
It doesn't change the fact that, at least on this data set, your algorithm, whatever it is, doesn't appear particularly performant.
Based on the current data, there is nothing to suggest your algorithm is good on divergent data as you suggest. All we know is it is not very good on notionally similar or related sequences (assuming that is the case for your example data).
I don't believe that DNA based alignment will ever be particularly good at resolving poorly conserved sequences (unless you know something about the ground truth of the evolution of that set of sequences and can calibrate accordingly). For this purpose, protein alignment and particularly HMM based alignment methods excel. I cannot see your algorithm outperforming them at the moment; certainly not until you provide some equivalent benchmark data anyway.
Some general advice: don't be so precious about your algorithm . You seem to be asking for opinions (repeatedly) but are not prepared to listen to answers or opinions which do not align with your preconceived ideas. It is possible your algorithm is no good. Be open to that possibility.
Unfortunately, that is not true either. An incensing statement runs counter to the assumption that people are here on good faith, and given that tone is not apparent online, it propagates worst assumptions on all sides. Let's just stick to the science and leave our personal feelings out of the conversation.
So you're calling AT dishonest? Really classy.
Let's please not continue this line of conversation. We are all professionals here and nothing needs to be taken personally.
For completeness: I actually noticed this elaborate sentence of yours after I commented, that is why I used
edit
as I always do to indicate changes. Believe it or not, the section above theedit
therefore even represents my "honest" opinion that was intended to be helpful. Negative criticism is unpleasant but sometimes necessary. Anyway, given that you repetitively, in this thread and the one you posted before, behave offended after receiving negative criticism makes me pull out of this thread and those you will post in the future. Good luck with your research effort.Just some updates given that people seem to be quite interested on that topic. (I) I didn't read all comments. Too much nonsesnse IMHO. I will try to focus only on what is interesting for me. Some people here seem to be very young and/or lacking many vital experiences. That's the better explanation I can find (not everybody, of course). (II) I have been making more tests with my algorithm. I improved it a bit to make it discard gaps (just discarding columns with too many gaps). My conclusions for now are: 1) It performs similarly to Clustal Omega when making phylogenetic trees for conserved regions (but faster). 2) It apparently performs better with non conserved regions. What kind of evaluation I am doing? I just make trees from the alignments and then compare trees and paiwise alignments in the tree. I find that the trees make similar sense than trees obtained with Clustal and pairwise alignments are similar with my algorithm and Clustal when using conserved regions but pairwise alignments are better when using my algorithm with unconserved regions. That's all. Quite subjective yet. I will continue doing some tests and will likely report. But one important thing is that you can use it in replacement of Clustal and there will not be much difference. Tested for example with the 35 mammals alignment data (got from Ensembl) and indeed my tree seems to make more sense than the Clustal one. (III) If you remember, it's not only that I made a software for replacement of Clustal Omega, I also made one as replacement for BLAST and I use also my own software for generating phylogenetic trees. I have spent quite more time on these than in the multialignment software. I don't care much if the multialignment is better or not. I just implemented it because the alignments I had been obtaining from Clustal and Coffee were not good enough in my opinion. People here making a big thing for something I clearly stated I did "in a day while watching tv". That's not something one says to try to give special value to his work... I didn't pretend to impress anyone with that part of my work... People seem quite confused, BTW that's their problem... This is a tree using Clustal: A tree with my algorithm: Will likely continue reporting when I have more data...
Stop it with the bad faith and ad hominem comments or you will be banned from the forum.
Can you share the actual Newick tree files for those trees?
How are you calculating branch lengths in your algorithm?
An important bioinformatics rule: trash in, is trash out. If you have that situation there is nothing to benchmark or to show. If you want to show that your algorithm is good then do it at least with a more real life situation.
Another thing, look at the title of this forum... it is called biostars and is made for bioinformatics. What you are doing now does not make any sense, if it is just a algorithm for string comparison then stackoverflow is maybe a better place.
At this stage its worse even than that I think, given that we know nothing of what the algorithm does or what its for (other than in theory being good for divergent sequences),
garbage in -> garbage out
could begold in -> garbage out
while is black-boxed in some mystery algorithm.Could someone explain me why my answer (the one I consider the right one) was moved to a comment while a deeply wrong answer is keeped?
Because yours isn’t an answer, it is more suitable as a comment. You came here looking for our opinions (ostensibly), so your own opinion that just ignores everything we said cannot by definition be an answer to the question.
Your content adds to the discussion but does not really answer the question as it looks like a defensive argument and not an objective statement. It would fit as an "answer" if this were a Forum discussion post. However, since this is a "Question", Mensur's post addresses the question better and is thus better suited as an answer.
Would you prefer if your content was an answer so you could accept it? As with all online communities, the community gets to decide what is most helpful but if you'd like for your post to be made an answer, we can go ahead and do that.