Question

How To Shift Alignment Of Short Sequences

1

Entering edit mode

11.7 years ago

random26078 ▴ 20

I have a file with multiple sequences like this:

ATTCCTACTGGCCCATAATC
TCCTACTGGCCCATAATCCA
ATTCCCACTGGCCCATAATC
GGCCCATAATCCAGCTGTGG
AGTCCTACTGGCCCATAATC
CTGGCCCATAATCCAGCTGT
ACTCCTACTGGCCCATAATC

I need to line up sequences 2-N in relation to sequence 1 so that my output file looks like this:

ATTCCTACTGGCCCATAATC
  TCCTACTGGCCCATAATCCA
ATTCCCACTGGCCCATAATC
         GGCCCATAATCCAGCTGTGG
AGTCCTACTGGCCCATAATC
       CTGGCCCATAATCCAGCTGT
ACTCCTACTGGCCCATAATC

I have tired to use BioStrings but get this:

library(Biostrings)
 s1 = DNAString("ATTCCTACTGGCCCATAATC")
 s2 = DNAString("TCCTACTGGCCCATAATCCA")
 pairwiseAlignment(s1, s2)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [3] TCCTACTGGCCCATAATC 
subject: [1] TCCTACTGGCCCATAATC 
score: -0.3283901

Which is not what I want. I tested all of the alignment methods in Biostrings and they all give the same answer. I have also looked at BioPerl options but none seem to work the way I need.

What I want to do is to introduce spaces into the beginning of the sequence to make the sequences line up when printed one on top of the other. I do not want to introduce gaps and I do not want to lose any of the bases in any sequence. I just want them shifted.

Does anyone know a way to do this?

perl bioperl r bioconductor • 2.7k views

ADD COMMENT • link updated 11.6 years ago by Kenosis ★ 1.3k • written 11.7 years ago by random26078 ▴ 20

score 3 · Answer 1 · 2013-04-05

Perhaps the following will be helpful:

use strict;
use warnings;

while ( my $seq1 = <> ) {
    chomp $seq1;
    print "$seq1\n";
    my $i = length $seq1;
    my $j = 0;

    defined( my $seq2 = <> ) or exit;
    chomp $seq2;

    $i-- and $j++ while substr( $seq1, $j ) ne substr( $seq2, 0, $i );
    print ' ' x $j, "$seq2\n";
}

Usage: perl script.pl InFile [>outFile]

The last, optional parameter will direct the output to a file:

Output on your dataset:

ATTCCTACTGGCCCATAATC
  TCCTACTGGCCCATAATCCA
ATTCCCACTGGCCCATAATC
         GGCCCATAATCCAGCTGTGG
AGTCCTACTGGCCCATAATC
       CTGGCCCATAATCCAGCTGT
ACTCCTACTGGCCCATAATC

The script works with lines in pairs. The while loop evaluates whether substrings of the two lines are equivalent. If not, each substring is shortened: the first one by one character at the beginning, and the second by one character at the end. If the substrings match, the match position (in $j) is used as a multiplier for leading spaces in front of the complete sequence that's printed, so alignment can be shown.

score 0 · Answer 2 · 2013-04-05

0

Entering edit mode

11.7 years ago

swbarnes2 14k

That output you quoted above is pretty close to what you want. Since the pattern number is [3], and the subject number is [1], it means you push s2 up by 2 spaces.

ADD COMMENT • link 11.7 years ago by swbarnes2 14k

0

Entering edit mode

Thanks swbarnes2: It is close but I have lost the first two bases form string 1. I need to retain all of the bases. I think my formatting was off in my initial post I changed the format so that what I want looks better.

ADD REPLY • link 11.7 years ago by random26078 ▴ 20