How To Shift Alignment Of Short Sequences
2
1
Entering edit mode
11.7 years ago
random26078 ▴ 20

I have a file with multiple sequences like this:

ATTCCTACTGGCCCATAATC
TCCTACTGGCCCATAATCCA
ATTCCCACTGGCCCATAATC
GGCCCATAATCCAGCTGTGG
AGTCCTACTGGCCCATAATC
CTGGCCCATAATCCAGCTGT
ACTCCTACTGGCCCATAATC

I need to line up sequences 2-N in relation to sequence 1 so that my output file looks like this:

ATTCCTACTGGCCCATAATC
  TCCTACTGGCCCATAATCCA
ATTCCCACTGGCCCATAATC
         GGCCCATAATCCAGCTGTGG
AGTCCTACTGGCCCATAATC
       CTGGCCCATAATCCAGCTGT
ACTCCTACTGGCCCATAATC

I have tired to use BioStrings but get this:

library(Biostrings)
 s1 = DNAString("ATTCCTACTGGCCCATAATC")
 s2 = DNAString("TCCTACTGGCCCATAATCCA")
 pairwiseAlignment(s1, s2)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [3] TCCTACTGGCCCATAATC 
subject: [1] TCCTACTGGCCCATAATC 
score: -0.3283901

Which is not what I want. I tested all of the alignment methods in Biostrings and they all give the same answer. I have also looked at BioPerl options but none seem to work the way I need.

What I want to do is to introduce spaces into the beginning of the sequence to make the sequences line up when printed one on top of the other. I do not want to introduce gaps and I do not want to lose any of the bases in any sequence. I just want them shifted.

Does anyone know a way to do this?

perl bioperl r bioconductor • 2.7k views
ADD COMMENT
3
Entering edit mode
11.7 years ago
Kenosis ★ 1.3k

Perhaps the following will be helpful:

use strict;
use warnings;

while ( my $seq1 = <> ) {
    chomp $seq1;
    print "$seq1\n";
    my $i = length $seq1;
    my $j = 0;

    defined( my $seq2 = <> ) or exit;
    chomp $seq2;

    $i-- and $j++ while substr( $seq1, $j ) ne substr( $seq2, 0, $i );
    print ' ' x $j, "$seq2\n";
}

Usage: perl script.pl InFile [>outFile]

The last, optional parameter will direct the output to a file:

Output on your dataset:

ATTCCTACTGGCCCATAATC
  TCCTACTGGCCCATAATCCA
ATTCCCACTGGCCCATAATC
         GGCCCATAATCCAGCTGTGG
AGTCCTACTGGCCCATAATC
       CTGGCCCATAATCCAGCTGT
ACTCCTACTGGCCCATAATC

The script works with lines in pairs. The while loop evaluates whether substrings of the two lines are equivalent. If not, each substring is shortened: the first one by one character at the beginning, and the second by one character at the end. If the substrings match, the match position (in $j) is used as a multiplier for leading spaces in front of the complete sequence that's printed, so alignment can be shown.

ADD COMMENT
0
Entering edit mode

This is the output I get:

ATTCCTACTGGCCCATAATC
                      TCCTACTGGCCCATAATCCA
ATTCCCACTGGCCCATAATC
                             GGCCCATAATCCAGCTGTGG
AGTCCTACTGGCCCATAATC
                           CTGGCCCATAATCCAGCTGT
ACTCCTACTGGCCCATAATC
ADD REPLY
0
Entering edit mode

I am getting the same result as Hernan is showing. How is it that your output lines up correctly?

ADD REPLY
0
Entering edit mode
11.7 years ago

That output you quoted above is pretty close to what you want. Since the pattern number is [3], and the subject number is [1], it means you push s2 up by 2 spaces.

ADD COMMENT
0
Entering edit mode

Thanks swbarnes2: It is close but I have lost the first two bases form string 1. I need to retain all of the bases. I think my formatting was off in my initial post I changed the format so that what I want looks better.

ADD REPLY

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6