Parsing Ace File To Fasta Formated Alignement
4
1
Entering edit mode
14.7 years ago

Hi,

Following alignment of 454 data, I want to convert an ACE file containing contig data to a FASTA file containing the 'aligned' consensus sequences. By aligned, I mean that each sequence within a contig has the same length. Characters (-) are added as needed on each side of a sequence in order to make it the same length as the consensus sequence from it's contig.

What method would you suggest?

Thanks!

fasta • 8.5k views
ADD COMMENT
0
Entering edit mode

Can you rephrase your question? This does not really make sense to me.

ADD REPLY
0
Entering edit mode

Ah now I get it: you want to convert the output of a sequence assembly (ACE assembly file format) into a FASTA file containing the consensus sequences, right?

ADD REPLY
0
Entering edit mode

Right :) I rephrase to make a lesser mouthful!

ADD REPLY
3
Entering edit mode
14.7 years ago

Another quick and dirty bioperl solution. Well, not so quick. This script is quite sequential.

use Bio::Assembly;
use Bio::SeqIO;
use Bio::Seq;

$infile = shift or die;
$outfile = shift or die;

$seqio_obj = Bio::SeqIO->new(-file => ">$outfile", -format => 'fasta' );

$assembly_obj = Bio::Assembly::IO->new(-file=>"<$infile", -format=>'ace');

$assembly = $assembly_obj->next_assembly;

foreach $contig ($assembly->all_contigs) {

    $seq = $contig->get_consensus_sequence()

    $seq_obj = Bio::Seq->new(-seq => $seq->seq(),                        
                          -display_id => "ID_is_always_good",                        
                          -desc => "Say_smth_about_it",                        
                          -alphabet => "dna" );

    $seqio_obj->write_seq($seq_obj);

}

I'm not sure if it works cause I'm without any ACE files. Of course, you can write your own parser as ACE files are very simple in structure, as one can see here. Check out Bio::Assembly methods. There a lot of ready-to-use utilities for size, quality, features, etc. I'll check for a biopython solution.

ADD COMMENT
2
Entering edit mode
14.7 years ago

@Jarretinha,

Thanks for your BioPerl solution! It gave me the urge to look at Biopython could do for me, since I only speak snake...

Here is what I found:

from Bio.Sequencing import Ace
from Bio.Align.Generic import Alignment
from Bio.Alphabet import IUPAC, Gapped

with open(output_file, "w") as output_file:
    while 1:
        try:
            contig = ace_gen.next()
        except:
            print "***All contigs treated***"
            break
        align = Alignment(Gapped(IUPAC.ambiguous_dna, "-"))
        align.add_sequence(contig.name, contig.sequence)
        for readn in xrange(len(contig.reads)):
            clipst = contig.reads[readn].qa.qual_clipping_start
            clipe = contig.reads[readn].qa.qual_clipping_end
            start = contig.af[readn].padded_start
            seq = cut_ends(contig.reads[readn].rd.sequence, clipst, clipe)
            seq = pad_read(seq, start, len(contig.sequence))
        sequences = read_fasta(align.format("fasta"))
        contig_name = re.findall("(Contig_[0-9]+)", sequences[0][0])[0]
        # Put your code here to work with the contig's sequences

I removed a lot of comments from the code and added a few features. The original example can be found HERE among the Biopython pages.

Thanks again!

ADD COMMENT
0
Entering edit mode

It seems that Bio.Align.Generic is deprecated, anyone knows how to do it with the new version?

ADD REPLY
2
Entering edit mode
13.8 years ago
Ketil 4.1k

Some time ago I wrote a few tools to extract stuff from ACE files, including the contigs + quality, the assembly as Fasta with '-' for gaps (what you're asking for), and also the clusters (as list of input sequences, a la TGICL output).

Drop me a mail if you're interested.

ADD COMMENT
1
Entering edit mode
13.2 years ago

Although elegant, the Bioperl/Biopython solutions are slow and tend to keep too many contig objects in memory. A simple ACE to Fasta perl extractor (assuming you want the contig sequences) would be this:

#!/usr/bin/perl
use strict;
use warnings;


# CO contig00001 67140 1618 1666 U

my $infile = $ARGV[0];
my $outfile = $ARGV[1];

open INPUT, $infile or die $!;
open OUTPUT, ">$outfile" or die $!;

my $waitForHeader = 1;

while (my $line = <INPUT>) {
    if ($waitForHeader) {
        if ($line=~"^CO") {
            my @splitter = split (" ",$line);
            print OUTPUT ">"."$splitter[1]\n";
            $waitForHeader = 0;
        }
        else {
            next;
        }
    }
    else {
        if ($line=~"^BQ") {
            $waitForHeader=1;
        }
        else {
            unless ($line eq "\n") {
                $line=~s/\*/-/g;
                print OUTPUT $line;
            }    
        }
    }
}

close INPUT;
close OUTPUT;
ADD COMMENT

Login before adding your answer.

Traffic: 2156 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6