Question

Please Suggest Good Read To Understand Sff To Fasta Conversion

3

Entering edit mode

14.2 years ago

Sashi Kiran Challa ▴ 310

hello all,

I have earlier worked with EST sequences, tools for assembly, gene finding ,etc. I am now interested in filling gaps about my knowledge in sequencing (especially Roche 454). I just know that once sequencing is done there is a SFF file obtained and that is converted to fasta for further analysis like assembly, etc. There are tools out there that do this conversion. Lot of them have the options like library name, insert size, insert size standard deviation. May I know what these options are and how they are decided? Is there any good paper or documentation for understanding this (sff to fasta) conversion?

Also any good read for understanding linked pairs,paired-end reads, splitting of linked pairs, why??

Thanks for your help,

Regards Sashi

fasta paired • 3.0k views

ADD COMMENT • link updated 14.2 years ago by biobot 0.0.77.a.1099 6.2k • written 14.2 years ago by Sashi Kiran Challa ▴ 310

score 5 · Answer 1 · 2010-09-11

The best place to gain an understanding of clone libraries (and therefore terms like library, library insert size, paired-end read and so on) is an introductory molecular biology textbook.

The library name will be an identifier string for the clone library that was sequenced. Insert size is the length in bases of a cloned insert in the library. What this really means depends on the context; it could be the precise size of a single insert, or it could be the desired size of insert aimed for during preparation in the laboratory. In reality, there will be a range of insert sizes in the library population. This brings us to insert size standard deviation, which is where the insert sizes are taken to follow a normal (bell curve) distribution - this is the standard deviation of that distribution.

You don't say what software you are using to convert SFF to Fasta, so I can't say how these values relate to your conversion task.

As to your specific question about method of SFF to Fasta conversion, the SFF specification describes the structure of the binary file. SFF data contains a sequence of estimates of homopolymer run lengths, plus basecalled nucleotide sequences and base quality values. Converting to Fasta means discarding all data except read name and basecalled nucleotide sequence.

Here is a post from the Biohaskell blog illustrating this nicely.