Question

Extracting and joining exons from multiple sequence alignments using PERL

1

Entering edit mode

8.6 years ago

Eskimo ▴ 120

Hi all,

I would really appreciate any help you can provide with the following coding problem (using PERL). I have an aligned multiple sequence alignment (fasta format) as my input, along with a file containing the co-ordinates of all open reading frames (ORFs) and the exons that make them up, I am wanting to slice out the ORFs according to those co-ordinates. So far I have code that works well for individual ORFs (and have no problem with the reverse complementing etc) but the problem lies in extracting and concatenating multi-exon ORFs.

Thus far my code reads in the multiple sequence alignment as follows

use Bio::SimpleAlign;
use Bio::AlignIO;
$str = Bio::AlignIO->new(-file => $inputfilename, -format => 'fasta');
$aln = $str->next_aln();

and deals with the splicing as follows

$mini = $aln->slice($array[0], $array[1]);  
$out = Bio::AlignIO->new(-file => $array[3],
                                     -format => 'fasta'); 
$out->write_aln($mini);

an example of the input file containing co-ordinates looks like this

Start Stop Strand Name Note
24 89 + ORF1 exon1
165 560 - ORF2 exon1
680 1004 + ORF3 exon1
1240 1760 + ORF3 exon2
1790 2360 + ORF3 exon3
2600 2900 - ORF4 exon1
2850 3100 + ORF 5 exon1

Would anyone know of a clever way to extract the individual exons for ORF3 and then concatenate the files (side by side to ensure that the multiple sequence alignment is not comprised)? My initial thought was to change the co-ordinate file to a GFF type file and use Bio::Tools::GFF but I don't think this is compatible with a multiple sequence alignment as the input.

Any help would be hugely appreciated!

multiple-sequence-alignment PERL exons fasta • 2.7k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 8.6 years ago by Eskimo ▴ 120

score 0 · Answer 1 · 2016-04-30

I'm not sure what you mean by "side by side" concatenation, but here's an idea to include exons from the same ORF in one file:

my @orfs;
unless ($array[3] ~~ @orfs) {
    $out = Bio::AlignIO->new(-file => $array[3], -format => 'fasta');
    push @orfs, $array[3];
}

$mini = $aln->slice($array[0], $array[1]); 
$out->write_aln($mini);

score 0 · Answer 2 · 2016-04-30

0

Entering edit mode

8.6 years ago

anp375 ▴ 190

Can you make a 2D array with blank spaces?

ADD COMMENT • link 8.6 years ago by anp375 ▴ 190

score 0 · Answer 3 · 2016-05-01

Hi folks, thanks for the answers so far. I just want to clarify my question a little further. Assume my multiple sequence alignment looks like this

genomeA

ACTCGAGCTATCGATCGATCATGCGAGCGCTACTAAATTTCATCGAGCGTATTCTATCTGAGCTAGCATCTTCA

genomeB

ACTCGAGCTATCGATCGATCATGCG---GCTACTATCTCATTCATCGAGCGTATTCGTGCTGAGCTAGCATCTTCA

genomeC

ACTCGAGCTATCTGCGATCATGCGAGCGCTACTATCTCATTCATCGAGCGTAATGTATCTGCTCTAGCATCTTCA

If I have a single exon between the aligned position 5 and 25 I can splice out the MSA in that region with

$mini = $aln->slice(5, 25);
$out = Bio::AlignIO->new(-file => $array[3], -format => 'fasta'); $out->write_aln($mini);

which would give me

genomeA

GAGCTATCGATCGATCATGC

genomeB

GAGCTATCGATCGATCATGC

genomeC

GAGCTATCTGCGATCATGCG

If I then have a second exon between position 30 and 35 I would splice that out with

$mini = $aln->slice(30, 35);
$out = Bio::AlignIO->new(-file => $array[3], -format => 'fasta'); $out->write_aln($mini)

which would give me

genomeA

GCTAC

genomeB

GCTAC

genomeC

GCTAC

What I then want to do use join the two MSA outputs into one so i get

genomeA

GAGCTATCGATCGATCATGCGCTAC

genomeB

GAGCTATCGATCGATCATGCGCTAC

genomeC

GAGCTATCTGCGATCATGCGGCTAC

(Note that I don't want the asterisks, I am just indicating where the sequences are being joined)

What I need is a piece of code that will concatenate the spliced regions from the same ORF (i.e. ORF3 exon1,2&3) but will then write this to the outfile and start again when it reaches a new ORF (i.e. ORF4).

Hopefully that is a little clearer.

Any further comments would be most appreciated.