Alignment of a sequence to a reference database.
1
0
Entering edit mode
5.3 years ago
juanjo75es ▴ 130

Hi,

I have developed a software for finding local alignments of a sequence into a larger database. Like BLAST but making use of a different technology.

I have found that it apparently has sometimes some advantages over BLAST. It also has some disadvantages given that it's still less developed and not running on the best hardware.

Some apparent advantages are:

  • It finds different results than BLAST. Combining BLAST and this tool you can get a more complete set of results.
  • The results often get a better score than BLAST results. Sometimes BLAST finds subsequences that match partially the seeked sequence but with awful coincidence in the rest of the sequence.
  • It seems to me better suited for making alignments of homolog or related sequences given that it doesn't focus primarily on partial results but on the best local alignments for the full sequence.

For example: this is an alignment of a seeked sequence (1st), the best result found by my software (2nd) and the best BLAST result (3rd):

GGCCGGGC-G-CGGTGGCTCACGCC-TGTAATCCCAGCACTTTGGGAGGC-CGAGGC-GGGCGGA--TC-ACGAG-GTCAGGAGATCGAGACCA-TCCTGG-C-CAACACG-GTGA
G-CCGGGCGGT-GGTGGCTCACGCCTT-TAATCCCAGCACTT-GGGAGGCA-GAGGCAGG-CGGATTTCT--GAGT-TCA--AG---G---CCAG-CCTGGTCT-A-CA-GAGTGA
GGCCGGGC-G-TGGTGGCGCACGCC-TTTAATCCCAGCAC-TTGGGAGGC-AGAGGC-AGGCGGA--T---------TTCTGAGTTCGAGGCCA-GCCTGG-T-CTACAAA-GTGA

Clearly the first result (the one found with this tool) is a better match and you would miss it if only using BLAST.

Another example. If you make a search for an homolog of the BRCA2 human gene in the chimpanzee genome you will find that with BLAST (despite not always returns the same):

Chromosome 13 Range: 13386740 to 13413300
Chromosome 9 Range: 109512074 to 109513130
Chromosome 15 Range: 29133481 to 29134527

Meanwhile, with my algorithm you will find these results in chromosome 13. The first one corresponds to the official homolog. The first result from BLAST also seems to match the right sequence, but the sequence position does not match the reference genome. The second result from my algorithm is a better alignment than the alternative ones found by BLAST. There are indeed many other better results in other chromosomes. For example here.

My question is. Does that software worth that I continue developing it? Is there a need for alternative BLAST results? Is it really better in some cases or I am missing some BLAST parameters that would improve the results? Is there a need for a tool for finding complete local alignments (not only subsequence alignment as BLAST does)?

Thanks.

alignment • 2.7k views
ADD COMMENT
2
Entering edit mode

BLAST is not a global alignment heuristic (it's in the name: Basic Local Alignment Search Tool) so if that's what you're trying to do, that's the wrong tool to use for comparison. To be a fair comparison, the implementations you compare should use the best settings for the case at hand. If you optimize the parameters for your tool but choose bad ones for the others then the comparison is worthless. You may also want to compare with other tools that implement other approaches such as exonerate and baseline algorithms like Needleman-wunsch.

ADD REPLY
0
Entering edit mode

I'm not trying to do a global alignment. You can see that I copy-pasted a local alignment. But I am looking for local alignments of the full sequence, not partial local alignments. Unfortunatelly I have not compared it against all possible parameters of BLAST nor using all possible parameters of my algorithm. That could take me months. I don't either know what uses all people do of BLAST. Maybe it's useful for some people but not for me. Anyway, after a filter phase the alignment is done using Needleman-wunsch. That's not the important point.

ADD REPLY
0
Entering edit mode

You can see that I copy-pasted a local alignement.

No you can not see that. If some one really need to guess what kind of alignment it is I think most people would say a multiple sequence alignement because it is more then one sequence. Maybe your tool works good but you explain it wrong now.

ADD REPLY
0
Entering edit mode

You correctly understood what I was saying... And a multiple alignment is still a LOCAL alignment. That can not in any way be a global alignment to a reference genome. Come on...

ADD REPLY
0
Entering edit mode

"come one..." I am not going to start a discussion but you really need to look up what the difference between local and global is. Think you also need to look up that there is a difference between pairwise and multiple alignment. And you are the one that posted this on this forum... Giving little and weird sounding information. And on every reply you acting pretty rude.

EDIT:

Just saw your reply explaining local to some one else and saying that blast is not really local. So yes, my reply stands and you need look up some stuff.

ADD REPLY
1
Entering edit mode

Don't know if I understand but the first sequence was your input (query), the second is the best hit of your tool and the third was the best hit of blast?

Clearly the first result (the one found with this tool) is a better match and you would miss it if only using BLAST.

So "the first result" is the second sequence, I don't see why this hit is clearly better can you explain that?

And not completely fair (apples and oranges) but if I do a global alignment, the blast hit has a higher identity.

ADD REPLY
0
Entering edit mode

Don't know if I understand but the first sequence was your input (query), the second is the best hit of your tool and the third was the best hit of blast?

That's right.

I say that the second sequence is a better alignment to the first one because with any scoring method that you use it gets a better score. The Levenshtein distance of the first match is 34 and for the second it's 40.

Sorry, I don't get what you mean with your last sentence...

ADD REPLY
1
Entering edit mode

I don't get those scores and I used this website: https://planetcalc.com/1721/ and this one http://www.unit-conversion.info/texttools/levenshtein-distance/

I also don't think this is the right way to test something like this. Also if you want to publish something which this looks like you need to give much more information. A key thing is that others need to be able to reproduce it. If this is not "publishing" maybe you need to add to your post what the goal is. It also does not look like a question.

but the sequence position does not match the reference genome

This is also confusing, this almost sounds like you are talking about a mapping tool.

ADD REPLY
0
Entering edit mode

I'm with gb on this. I don't see why 'your' hit is a better one than the blast one ?

of course if you use some weird/wrong/other/... scoring schema you will see score differences. The one blast uses are on the other hand well established ones based on empirical observations.

ADD REPLY
0
Entering edit mode

Is there a need for alternative BLAST results?

I would say yes (though being a big blast believer as well) but mainly on the speed side of things, not really on the quality of results returned by blast.

I think it might also be worth pointing out that blast is a search tool (cfr "google for sequences") NOT an alignment tool! so in that sense there are others that do a much better job at creating the best/good alignment but those come with a tremendous 'cost' being much slower than blast.

ADD REPLY
3
Entering edit mode
5.3 years ago
Mensur Dlakic ★ 28k

Does that software worth that I continue developing it? Is there a need for alternative BLAST results?

BLAST is not meant for aligning sequences to a reference database. It is meant primarily for quickly finding matches against large databases, with an understanding that there is small loss of sensitivity involved. Other sequence search tools also often operate with an understanding that there is a trade-off between speed and sensitivity. There are tools with greater sensitivity than BLAST (old WU-BLAST used to be one). Rather than passing judgment on your tool based on the limited information you provided, I'll just tell you that BLAST has been developed by a team of people for decades. Same is true for many other sequence search tools.

From an oddsmaker's point of view, it is unlikely that you have something that has greater general utility than BLAST, though it may serve your specific needs better. I suggest you read up on the history of these programs as that will help you understand the whole field better, and hopefully give you a clearer idea where your program fits within it.

https://en.wikipedia.org/wiki/BLAST_(biotechnology)

https://en.wikipedia.org/wiki/FASTA

https://ab.inf.uni-tuebingen.de/software/diamond/

https://genome.ucsc.edu/cgi-bin/hgBlat

https://academic.oup.com/bioinformatics/article/32/17/i680/2450775

ADD COMMENT
0
Entering edit mode

Of course I understand that BLAST makes a trade-off between speed and sensitibility... Same as my tool. But obviously with the goal to make a local alignment (sometimes to a reference database).

ADD REPLY
0
Entering edit mode

From an oddsmaker's point of view, it is unlikely that you have something that has greater general utility than BLAST, though it may serve your specific needs better.

This is the main consideration: what is the purpose of your tool and are there already tools in that space? Once this is defined, you need to benchmark adequately even if you don't intend to publish because otherwise, you're just going to fool yourself.

ADD REPLY
0
Entering edit mode

Please just respond to my questions if you can and just don't waste your time if you are not interested or you can't,

ADD REPLY
0
Entering edit mode

Does that software worth that I continue developing it?

You don't give enough details about what it does and haven't benchmarked it so most likely the answer is no.

Is there a need for alternative BLAST results?

There are already alternatives. You don't say what the strength(s) of your approach is so the answer is probably no.

Is it really better in some cases or I am missing some BLAST parameters that would improve the results?

Do proper benchmarking and show the results otherwise the question can't be answered.

Is there a need for a tool for finding complete local alignments (not only subsequence alignment as BLAST does)?

You'll have to define what complete local alignment means but my suspicion is that this is something that can already be obtained by selecting the right parameters of existing tools so the answer is again most likely no.

ADD REPLY
0
Entering edit mode

I think you don't understand many things. Not sure what experience you have in software development (which is not the same as applying informatics to science). Benchmarking is useful once you have a final product and a final dataset to test it for... Quite elemental to understand. That's not the case actually for that algorithm. And that's why I am asking if (IN CASE it would really work better in some cases) it would be useful and worth it. And in which cases it could be useful and under which circumstances (given that it's a trade-off) You answer to every single question is "I don't know but no". That's quite helpful to understand your attitude but not to solve the doubts.

ADD REPLY
0
Entering edit mode

So you don't understand what a local alignment is... I don't think that's the right place for such elemental questions but here you have an answer. A global alignment to the reference genome would be an alignment in which one of the sequences is the full genome and the other one millions of gaps and a few nucleotides. A local alignment of the complete sequence is (as anyone should easily understand) indeed just a local alignment of the sequence. That's the definition. Just look for a local alignment and you have the "complete local alignment". The problem is not in my definition but in BLAST one's. BLAST does not really make local alignments. BLAST finds subsequences in the query sequence that have high score local alignments, but usually it does not return a formal local alignment.

ADD REPLY
0
Entering edit mode

A global alignment to the reference genome would be an alignment in which one of the sequences is the full genome and the other one millions of gaps and a few nucleotides.

Where did you get the "millions of gaps" part? That's not an alignment at all. "Millions of gaps and a few nucleotides" is what we would get by throwing random nucleotides at a genome sequence. Alignment would give up after a much smaller number of gaps (and that too in a semi-global alignment). In true global alignment, you're looking to align the entire sequence A to the entire sequence B. Semi-global is where Sequence B aligns to a small portion of Sequence A.

ADD REPLY
0
Entering edit mode

And there is actually already enough benchmarking data for some reasonings... But will not insist on that given that it seems we are blocked with more basic concepts.

ADD REPLY
2
Entering edit mode

You would be wise to reconsider your tone if you're going to continue to post here.

You have asked for people's opinions about how they perceive a possible competitor to BLAST; you do not also get to tell them their opinions are wrong or they don't understand.

ADD REPLY

Login before adding your answer.

Traffic: 1887 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6