Parse sim4 output alignment format
0
0
Entering edit mode
5.8 years ago
rubic ▴ 270

Hi,

This is somewhat redundant with this post, which has not been fully answered.

I'm using Webb Miller's sim4 aligner, which output (specifying A=1 in the arguments) looks like this:

seq1 = /data/seq1.fa, 1095 bp 
seq2 = /data/seq2.fa (chrA:23544043-25452600), 1908558 bp


      0     .    :    .    :    .    :    .    :    .    :
     96 CTTC ACATC CG TGTCCCGGCCCGGCCGCGGGGAGCCCCGCTTCATCG
        ||||-|||||-||-|||||||||||||||  |||||||||||||||||
1136214 CTTCGACATCGCGGTGTCCCGGCCCGGCCAAGGGGAGCCCCGCTTCATGT

     50     .    :    .    :    .    :    .    :    .    :
    143 CCGTGGGCTACGTGGACGACACGCAGTTCGTGCGGTTCGACAGCGACGCC
        |||||||||||||||||||||||||||||||||| |||||||||||||||
1136264 CCGTGGGCTACGTGGACGACACGCAGTTCGTGCGCTTCGACAGCGACGCC

    100     .    :    .    :    .    :    .    :    .    :
    193 GCGAGCCAGAGGATGGAGCCGCGGGCGCCGTGGATAGAGCAGGAGGGGCC
        ||||||| |||| |||||||||||||| ||||||| |||| || ||  |
1136314 GCGAGCCCGAGGGTGGAGCCGCGGGCGGCGTGGATGGAGCGGGTGGACCA

    150     .    :    .    :    .
    243 GGAGTATTGGGACCAGGAGACACGGA
        || ||||||||||| |||||||||||
1136364 GGCGTATTGGGACCGGGAGACACGGA

      0     .    :    .    :    .    :    .    :    .    :
    320 GCTACTACAACCAGAGCGAGGCCG         GTTCTCACACCATCCAG
        ||||||||||||||| ||||||||>>>...>>>| ||||||||| |||||
1136441 GCTACTACAACCAGACCGAGGCCGGTG...CAGGCTCTCACACCTTCCAG

     50     .    :    .    :    .    :    .    :    .    :
    361 ATAATGTATGGCTGCGACGTGGGGTCGGACGGGCGCTTCCTCCGCGGGTA
        |  ||||| |||||||| |||||||||||||||||| |||||||||||||
1136677 AGGATGTACGGCTGCGAAGTGGGGTCGGACGGGCGCCTCCTCCGCGGGTA

    100     .    :    .    :    .    :    .    :    .    :
    411 C CGGCAG GACGCCTACGACGGCAAGGATTACATCGCCCT GAACGAGG
        |-||-|||-| -||||||||||||  |||||||||||||||-|  -||||
1136727 CACG CAGTGG GCCTACGACGGCGCGGATTACATCGCCCTCGCT GAGG

    150     .    :    .    :    .    :    .    :    .    :
    458 ACCTGCGCTCTTGGACCGCGGCGGACATGGCGGCTCAGATCAC CAAGCG
        ||||| |||| |||   ||||||||||  ||||||| ||||||-|-||||
1136774 ACCTGAGCTCCTGGGTGGCGGCGGACACCGCGGCTCTGATCACAC AGCG

    200     .    :    .    :    .    :    .    :    .    :
    507 CAAGTGGGAGGCGGCCCATGAGGCGGAGCAGTTGAGAGCCTACCTGGATG
        |||||||| || ||||   | ||| ||||     || ||||||||||| |
1136823 CAAGTGGGTGGAGGCCGGCGTGGCAGAGCGCCACAGGGCCTACCTGGAGG

    250     .    :    .    :    .    :    .    :    .    :
    557 GCACGTGCGTGGAGTGGCTCCGCAGATACCTGGAGAACGGGAAGGAG AC
        |||| || |||||||||||||||||| ||||||||||||||||||||-|-
1136873 GCACCTGTGTGGAGTGGCTCCGCAGACACCTGGAGAACGGGAAGGAGCA

    300     .    :
    606 GCTGCAGCGCAC
        ||||||||||||
1136922 GCTGCAGCGCAC

What I want to get out of this is what are the indel positions in the alignment. In this example there are two aligned blocks (they start with the 0 indicator in the alignment consensus part on top of the aligned sequences). The results are (0-based):

for the first block: 4, 10, 13

for the second block: 101, 104, 108, 111, 141, 145, 193, 195, 297, 299

Naturally it would be nice to use some existing tool (preferably an R package) to read in this alignment and to extract it directly from the read-in alignment object but I can't even figure out what format this output is in.

Any help would be highly appreciated.

sim4 pairwise alignment parse • 1.1k views
ADD COMMENT
0
Entering edit mode

I was going to suggest using BioPerl sim4 SearchIO parser but this is not appear to be working for me...the parsed resuts in something like this appear wrong to me.... $searchio = Bio::SearchIO->new(-file => 'test.blastx', -format => 'sim4');

ADD REPLY

Login before adding your answer.

Traffic: 1916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6