Question

How To Convert 454 Data To Sam Format?

1

Entering edit mode

14.1 years ago

Litali ▴ 50

Many viewers are adjusted to SAM format, How can I convert 454 output to this format? Thank you!

EDIT: The OP specifies that: 'It would be ok to have either the ACE file or the sff file in SAM format'.

viewer sam alignment • 9.3k views

ADD COMMENT • link updated 11.2 years ago by Biostar 20 • written 14.1 years ago by Litali ▴ 50

2

Entering edit mode

Hi litali, It is rather unclear what you mean by '454 output', mostly since you want to put it in an alignment format. Are you referring to the .sff file that comes out of the Roche sequencer? Or maybe to the sequences once they are assembled, possibly in .ace format? This should help us help you. Cheers.

ADD REPLY • link 14.1 years ago by Eric Normandeau 11k

0

Entering edit mode

added the script

ADD REPLY • link 14.1 years ago by Michael 55k

Ram · Answer 1 · 2010-10-05

It is a totally justified question, though it's an alignment process what is required not only a conversion, there are several possible pipelines. Also, knowing what data you are having would help a lot.

You need data in fasta or fastq format and your reference genome in fasta format.

If your data is in .sff (Standard Flowspace Format) you have to convert to fasta format using the sffinfo program coming with the 454 software.

I have a rather old version of the GS FLX manual and there sffinfo didn't write a fastq file, but both a fasta file and a quality file. Another option is sff_extract, but that doesn't give fastq either.

The data can be combined into a fastq file using a simple perl script (I can post one if required), or discard the qualities and align the fasta file only.

Then align your 454 reads against the reference sequence/genome using an alignment software that can output SAM format and works with "medium length" reads. One tool that directly aligns fasta and gives SAM is lastz, you have to play with the switches though.

BWA is another option but requires fastq, depending on read-length use BWA-SW algorithm.
SSAHA2 was mentioned before.
shrimp supports both fastq and fasta and should also support longer reads
there are many more tools here, your mileage may vary
keep in mind the read lengths of the 454 reads
as read lengths vary with 454, I prefer a percent-wise identity cutoff over an absolute number of mismatches

Simple as that ;)

Edit, here is a simple perl script that makes a fastq file out of fasta file and a qualiti file. It's not much tested and if the headers and data in fasta and qual file are not exactly matching, it fails miserably.

#!/usr/bin/env perl

use strict;
use warnings;

die ("Usage: fasta2fastq <fasta.file> <qual.file>") unless  (scalar @ARGV) == 2;

open FASTA, $ARGV[0] or die "cannot open fasta: $!\n";
open QUAL, $ARGV[1] or die "cannot open qual: $!\n";

my $offset = 33; # I think this was 33 for sanger FASTQ, change this if required!
my $count = 0;

local($/) = "\n>"; # split the input fasta file by FASTA records
# this is some splitting of the fasta by line
while (my $fastarec = <FASTA>) {
  chomp $fastarec;
  my ($fid, @seq) = split "\n", $fastarec;   
  my $seq = join "", @seq; $seq =~ s/\s//g;
  my $qualrec = <QUAL>;
  chomp $qualrec;
  my ($qid, @qual) = split "\n", $qualrec;
  @qual = split /\s+/, (join( " ", @qual));
  # convert score to character code:
  my @qual2 = map {chr($_+$offset)} @qual;
  my $quals = join "", @qual2; 
  die "missmatch of fasta and qual: '$fid' ne '$qid'" if $fid ne $qid;
  $fid =~ s/^\>//;
  print STDOUT (join( "\n", "@".$fid, $seq, "+$fid", $quals), "\n");
  $count++;
}
close (FASTA);
close (QUAL);
print STDERR "wrote $count entries\n";

Ram · Answer 2 · 2010-10-05

1

Entering edit mode

14.1 years ago

Ian 6.1k

This may only partially help, but SSAHA2 reportedly outputs SAM format.

A similar question has also been previously posted on BioStar.

ADD COMMENT • link updated 5.8 years ago by Ram 44k • written 14.1 years ago by Ian 6.1k

score 1 · Answer 3 · 2010-10-11

1

Entering edit mode

14.1 years ago

Casbon ★ 3.3k

I had success with glu genetics, but you might need to fight the installer as noted on the question I asked and answered.

ADD COMMENT • link 14.1 years ago by Casbon ★ 3.3k

score 0 · Answer 4 · 2010-11-06

0

Entering edit mode

14.1 years ago

Lhl ▴ 760

try mosaik aligner. It is well designed for working with 454 data and it supports SAM format (you need use mosaiktext to transfer mosaikalign.dat to sam format although).

ADD COMMENT • link 14.1 years ago by Lhl ▴ 760