Question

Parsing Ace File To Fasta Formated Alignement

1

Entering edit mode

15.1 years ago

Eric Normandeau 11k

Hi,

Following alignment of 454 data, I want to convert an ACE file containing contig data to a FASTA file containing the 'aligned' consensus sequences. By aligned, I mean that each sequence within a contig has the same length. Characters (-) are added as needed on each side of a sequence in order to make it the same length as the consensus sequence from it's contig.

What method would you suggest?

Thanks!

fasta • 8.9k views

ADD COMMENT • link updated 18 months ago by Ram 45k • written 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

Can you rephrase your question? This does not really make sense to me.

ADD REPLY • link 15.1 years ago by Michael 55k

0

Entering edit mode

Ah now I get it: you want to convert the output of a sequence assembly (ACE assembly file format) into a FASTA file containing the consensus sequences, right?

ADD REPLY • link 15.1 years ago by Michael 55k

0

Entering edit mode

Right :) I rephrase to make a lesser mouthful!

ADD REPLY • link 15.1 years ago by Eric Normandeau 11k

Ram · Answer 1 · 2010-04-09

Another quick and dirty bioperl solution. Well, not so quick. This script is quite sequential.

use Bio::Assembly;
use Bio::SeqIO;
use Bio::Seq;

$infile = shift or die;
$outfile = shift or die;

$seqio_obj = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta' );

$assembly_obj = Bio::Assembly::IO->new(-file=>"<$infile", -format=>'ace');

$assembly = $assembly_obj->next_assembly;

foreach $contig ($assembly->all_contigs) {

    $seq = $contig->get_consensus_sequence()

    $seq_obj = Bio::Seq->new(-seq => $seq->seq(),                        
                          -display_id => "ID_is_always_good",                        
                          -desc => "Say_smth_about_it",                        
                          -alphabet => "dna" );

    $seqio_obj->write_seq($seq_obj);

}

I'm not sure if it works cause I'm without any ACE files. Of course, you can write your own parser as ACE files are very simple in structure, as one can see here. Check out Bio::Assembly methods. There a lot of ready-to-use utilities for size, quality, features, etc. I'll check for a biopython solution.

Ram · Answer 2 · 2010-04-10

@Jarretinha,

Thanks for your BioPerl solution! It gave me the urge to look at Biopython could do for me, since I only speak snake...

Here is what I found:

from Bio.Sequencing import Ace
from Bio.Align.Generic import Alignment
from Bio.Alphabet import IUPAC, Gapped

with open(output_file, "w") as output_file:
    while 1:
        try:
            contig = ace_gen.next()
        except:
            print "***All contigs treated***"
            break
        align = Alignment(Gapped(IUPAC.ambiguous_dna, "-"))
        align.add_sequence(contig.name, contig.sequence)
        for readn in xrange(len(contig.reads)):
            clipst = contig.reads[readn].qa.qual_clipping_start
            clipe = contig.reads[readn].qa.qual_clipping_end
            start = contig.af[readn].padded_start
            seq = cut_ends(contig.reads[readn].rd.sequence, clipst, clipe)
            seq = pad_read(seq, start, len(contig.sequence))
        sequences = read_fasta(align.format("fasta"))
        contig_name = re.findall("(Contig_[0-9]+)", sequences[0][0])[0]
        # Put your code here to work with the contig's sequences

I removed a lot of comments from the code and added a few features. The original example can be found HERE among the Biopython pages.

Thanks again!

Ram · Answer 3 · 2011-03-31

2

Entering edit mode

14.1 years ago

Ketil 4.2k

Some time ago I wrote a few tools to extract stuff from ACE files, including the contigs + quality, the assembly as Fasta with '-' for gaps (what you're asking for), and also the clusters (as list of input sequences, a la TGICL output).

Drop me a mail if you're interested.

ADD COMMENT • link updated 18 months ago by Ram 45k • written 14.1 years ago by Ketil 4.2k

Ram · Answer 4 · 2011-10-21

Although elegant, the Bioperl/Biopython solutions are slow and tend to keep too many contig objects in memory. A simple ACE to Fasta perl extractor (assuming you want the contig sequences) would be this:

#!/usr/bin/perl
use strict;
use warnings;


# CO contig00001 67140 1618 1666 U

my $infile = $ARGV[0];
my $outfile = $ARGV[1];

open INPUT, $infile or die $!;
open OUTPUT, ">$outfile" or die $!;

my $waitForHeader = 1;

while (my $line = <INPUT>) {
    if ($waitForHeader) {
        if ($line=~"^CO") {
            my @splitter = split (" ",$line);
            print OUTPUT ">"."$splitter[1]\n";
            $waitForHeader = 0;
        }
        else {
            next;
        }
    }
    else {
        if ($line=~"^BQ") {
            $waitForHeader=1;
        }
        else {
            unless ($line eq "\n") {
                $line=~s/\*/-/g;
                print OUTPUT $line;
            }    
        }
    }
}

close INPUT;
close OUTPUT;