Given two short sequences (~1000 bp) I want to find all local alignments between them
1
0
Entering edit mode
6.8 years ago

I have two sequences like those at the end. I want to find all local alignment larger than 10 and maybe shorter than some number (maybe I can filter that out later).

I was trying to do it with a smith-waterman, but I cannot find an implementation that retrieves me the from / to value to indentify where the results come from.

I want to allow mismatch and gaps, any ideas on what tool can I use?

It has to be fast since I will be splitting a full chromosome / genome and re doing this task many times. I'd also like to avoid reverse complement pairing.

>1
ATGGCTAGGAAACATAACCATCTTTGATCAACGAGCTAGTCAAGTAGAGGCATACTAGTG
ACACTCTGTTTGTCTATGTATTCACACATGTATCATGTTTCCGGTTAATACAATTCTAGC
ATGAATAATAAACATTTATCATGATATAAGGAAATAAATAATAACTTTATTGTTGCCTCT
AGGGCATATTTCCTTCAGAGTGCTAGTTGAAGGCACGTGATGTGCATCTTAGGTTAAACT
GTAAATCGAATCTTACTCTCGAGTAACTCCAAAAGCGGCCACCGAAAATCATTATTGGAT
TTGTTTTTCATTACCAGCACGTCGAGGTGAAAGATGTGCAGGTTCAGTGGGTCGACTTGC
GTTTTCACTGACATGTGGGACCGAATGTAAGCAAACAGACGATGGGCACACTCCGCGCGT
CTGCTGGCTGGCTGAGAAGAGAGAGGAGGGTTGAGCACGGACTGTAGGAAATTTCGTGAC
AATGTGAGTAGTACTAGTGGTGTACTGTACTCCTCGAGAGATAATGTGTTCGCTCTACGT
AACCAGCCCCTCGATATAGTGGGGTGGCATGGGCCATAACTAGCATTCACGTCTAACAGT
ACGGCACATTCCACCAATTTTCTTGGAATCCATGCTACAGCTATCTCTTCTCCCTCCTAT
TTTCTCTAACTCAGCCATCGCGCGGTGGGAGGGGTGGGGGTTTGCCGATGTCAGTTTGCT
CAACTTCGGTCCCACTTGTAAGTGAAAATGCAACACAGCACGCATCCCCCCTTGAGCGGC
GTGCCAGACCAACCGTGTGGAGGTGCGATGCGAAAATGGCAGGCTTTTCCTTTTGAACTT
GTAACATGTAAATTTTGGTTCTGCTCCTTGGCCCGACCGATAGATAGTAATGAGAAGTTG
ATGCCAAAAAATGAGAAGTTTGGTTCTGGCGCGGTTCATGCATGGTACGTACAAAAATTA
TGAAGTTGATGCAAAAGGCAAGATAATTACTAGTACAATATAGTCCCTCCATTCCTAAAT
ATTTGTATTTTTAGGAAAGATACATATGGATCCACCAACGACATCCTCACGCCCTAGTCC
GCCCGGTCACCAACCAACGGGTGAGGAGCAGCAGCGTTGGCGGCGAGCTCAACGTGCTTC

>1rc
CAATCGCAAGATCAGTTGACTAATGGAAGCTATACCCATGGAGGCGATGAGCGCGAGCAT
CAGCCGACTGTGCCAGATGGTCCATGACGCCGGCCTGCGGCCTGGCACCGAGGAACGTCT
CCAGGTCGTGCTTGAGGCCGCTAGGGCGAGAGGTCTCTTGGACGACAGCTTCGTCTCCTT
GTTCGACAAGGTCCTCGTTGGATTCCTCGACAAGTTCAACGTCGTGAAGAAGCTCACGGA
CGACCTTGACATACGCCTCCAGCCCACGCGCCCAGGCTCTACGATGCCCGCCACCCTCAA
CGACCTCTATGATGACAACCTCTTCGATGCACTGGTGGACCTGCGACTGCCCGTCGTCGT
GCCGGAGATTGTCCACCTCGAGGTCACGCTCGCCGCGCAGCGCCTGGCACAGTAAGACAC
CATCGACATAATCACCCACGTCTACGCGCAAATCGTCCACAAGGACTACTACATGCCAGA
GGAGGAGGATAGGACGCTGGCCTTCTTGGAACGCAGGGCAACCTTGGACGGCATTGTTTA
GAAGCACGTTGAGCTCGCCGCCAACGCTGCTGCTCCTCACCCGTTGGTTGGTGACCGGGC
GGACTAGGGCGTGAGGATGTCGTTGGTGGATCCATATGTATCTTTCCTAAAAATACAAAT
ATTTAGGAATGGAGGGACTATATTGTACTAGTAATTATCTTGCCTTTTGCATCAACTTCA
TAATTTTTGTACGTACCATGCATGAACCGCGCCAGAACCAAACTTCTCATTTTTTGGCAT
CAACTTCTCATTACTATCTATCGGTCGGGCCAAGGAGCAGAACCAAAATTTACATGTTAC
AAGTTCAAAAGGAAAAGCCTGCCATTTTCGCATCGCACCTCCACACGGTTGGTCTGGCAC
pairwise alignment local • 1.8k views
ADD COMMENT
0
Entering edit mode

can't you run BLAST with reduced word_length?

https://www.ncbi.nlm.nih.gov/books/NBK279684/

word_size blastn integer 11 Length of initial exact match.

ADD REPLY
5
Entering edit mode
6.8 years ago

BLAST will give you all possible local alignments that satisify certain alignment parameters. Therefore, make sure you:

  1. set high E-value (e.g. 10e+14)
  2. set short word size (e.g. 4)
  3. set low gap penalties
  4. disable sequence masking.

To run BLAST to align only two sequences (rather than search through database) just type:

blastn -query seq1.fa -subject seq2.fa -dust no -soft_masking false -evalue 1e+14 -word_size 4 -task blastn-short

You can add -outfmt 7 to save results in tab-separeted format.

You can also do this comparison on-line on the NCBI BLAST website by checking the option Align two or more sequences.

Sample output:

BLASTN 2.2.28+


Query= 1

Length=1140

Subject= 1rc

Length=900


 Score =  714 bits (360),  Expect = 0.0
 Identities = 360/360 (100%), Gaps = 0/360 (0%)
 Strand=Plus/Minus

Query  781   GTGCCAGACCAACCGTGTGGAGGTGCGATGCGAAAATGGCAGGCTTTTCCTTTTGAACTT  840
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  900   GTGCCAGACCAACCGTGTGGAGGTGCGATGCGAAAATGGCAGGCTTTTCCTTTTGAACTT  841

Query  841   GTAACATGTAAATTTTGGTTCTGCTCCTTGGCCCGACCGATAGATAGTAATGAGAAGTTG  900
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  840   GTAACATGTAAATTTTGGTTCTGCTCCTTGGCCCGACCGATAGATAGTAATGAGAAGTTG  781

Query  901   ATGCCAAAAAATGAGAAGTTTGGTTCTGGCGCGGTTCATGCATGGTACGTACAAAAATTA  960
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  780   ATGCCAAAAAATGAGAAGTTTGGTTCTGGCGCGGTTCATGCATGGTACGTACAAAAATTA  721

Query  961   TGAAGTTGATGCAAAAGGCAAGATAATTACTAGTACAATATAGTCCCTCCATTCCTAAAT  1020
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  720   TGAAGTTGATGCAAAAGGCAAGATAATTACTAGTACAATATAGTCCCTCCATTCCTAAAT  661

Query  1021  ATTTGTATTTTTAGGAAAGATACATATGGATCCACCAACGACATCCTCACGCCCTAGTCC  1080
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  660   ATTTGTATTTTTAGGAAAGATACATATGGATCCACCAACGACATCCTCACGCCCTAGTCC  601

Query  1081  GCCCGGTCACCAACCAACGGGTGAGGAGCAGCAGCGTTGGCGGCGAGCTCAACGTGCTTC  1140
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  600   GCCCGGTCACCAACCAACGGGTGAGGAGCAGCAGCGTTGGCGGCGAGCTCAACGTGCTTC  541


 Score = 22.3 bits (11),  Expect = 0.19
 Identities = 11/11 (100%), Gaps = 0/11 (0%)
 Strand=Plus/Minus

Query  894  GAAGTTGATGC  904
            |||||||||||
Sbjct  719  GAAGTTGATGC  709


 Score = 22.3 bits (11),  Expect = 0.19
 Identities = 11/11 (100%), Gaps = 0/11 (0%)
 Strand=Plus/Minus

Query  962  GAAGTTGATGC  972
            |||||||||||
Sbjct  787  GAAGTTGATGC  777


 Score = 20.3 bits (10),  Expect = 0.77
 Identities = 10/10 (100%), Gaps = 0/10 (0%)
 Strand=Plus/Minus

Query  833  TTGAACTTGT  842
            ||||||||||
Sbjct  219  TTGAACTTGT  210


 Score = 20.3 bits (10),  Expect = 0.77
 Identities = 10/10 (100%), Gaps = 0/10 (0%)
 Strand=Plus/Minus

Query  854  TTTGGTTCTG  863
            ||||||||||
Sbjct  762  TTTGGTTCTG  753


 Score = 20.3 bits (10),  Expect = 0.77
 Identities = 10/10 (100%), Gaps = 0/10 (0%)
 Strand=Plus/Minus

Query  919  TTTGGTTCTG  928
            ||||||||||
Sbjct  827  TTTGGTTCTG  818

and the list continues....

Sample tab-separeted output:

# BLASTN 2.2.28+
# Query: 1
# Subject: 1rc
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 5677 hits found
1   1rc 100.00  360 0   0   781 1140    900 541 0.0  714
1   1rc 100.00  11  0   0   894 904 719 709 0.19    22.3
1   1rc 100.00  11  0   0   962 972 787 777 0.19    22.3
1   1rc 100.00  10  0   0   833 842 219 210 0.77    20.3
1   1rc 100.00  10  0   0   854 863 762 753 0.77    20.3
1   1rc 100.00  10  0   0   919 928 827 818 0.77    20.3
1   1rc 92.31   13  1   0   1065    1077    153 141 3.0 18.3
1   1rc 100.00  9   0   0   1120    1128    396 388 3.0 18.3
1   1rc 100.00  9   0   0   620 628 503 511 3.0 18.3
1   1rc 100.00  8   0   0   781 788 70  77     12   16.4
1   1rc 100.00  8   0   0   625 632 206 199    12   16.4
1   1rc 100.00  8   0   0   532 539 276 283    12   16.4
1   1rc 100.00  8   0   0   322 329 381 374    12   16.4
1   1rc 91.67   12  1   0   1053    1064    418 429    12   16.4
1   1rc 100.00  8   0   0   588 595 437 444    12   16.4
1   1rc 100.00  8   0   0   442 449 478 485    12   16.4
1   1rc 100.00  8   0   0   861 868 570 577    12   16.4
1   1rc 100.00  8   0   0   610 617 630 623    12   16.4

and the list continues..

ADD COMMENT
0
Entering edit mode

I think sequence masking is disabled by default? I cannot find an option to do so

ADD REPLY
0
Entering edit mode

By default, sequence masking is enabled in BLAST. To disable it, add these two arguments: -soft_masking false -dust no.

ADD REPLY
0
Entering edit mode

Does this answer your question?

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6