Please Suggest Good Read To Understand Sff To Fasta Conversion
1
3
Entering edit mode
14.2 years ago

hello all,

I have earlier worked with EST sequences, tools for assembly, gene finding ,etc. I am now interested in filling gaps about my knowledge in sequencing (especially Roche 454). I just know that once sequencing is done there is a SFF file obtained and that is converted to fasta for further analysis like assembly, etc. There are tools out there that do this conversion. Lot of them have the options like library name, insert size, insert size standard deviation. May I know what these options are and how they are decided? Is there any good paper or documentation for understanding this (sff to fasta) conversion?

Also any good read for understanding linked pairs,paired-end reads, splitting of linked pairs, why??

Thanks for your help,

Regards Sashi

fasta paired • 3.0k views
ADD COMMENT
5
Entering edit mode
14.2 years ago

The best place to gain an understanding of clone libraries (and therefore terms like library, library insert size, paired-end read and so on) is an introductory molecular biology textbook.

The library name will be an identifier string for the clone library that was sequenced. Insert size is the length in bases of a cloned insert in the library. What this really means depends on the context; it could be the precise size of a single insert, or it could be the desired size of insert aimed for during preparation in the laboratory. In reality, there will be a range of insert sizes in the library population. This brings us to insert size standard deviation, which is where the insert sizes are taken to follow a normal (bell curve) distribution - this is the standard deviation of that distribution.

You don't say what software you are using to convert SFF to Fasta, so I can't say how these values relate to your conversion task.

As to your specific question about method of SFF to Fasta conversion, the SFF specification describes the structure of the binary file. SFF data contains a sequence of estimates of homopolymer run lengths, plus basecalled nucleotide sequences and base quality values. Converting to Fasta means discarding all data except read name and basecalled nucleotide sequence.

Here is a post from the Biohaskell blog illustrating this nicely.

ADD COMMENT
0
Entering edit mode

thanks so much Keith.

ADD REPLY

Login before adding your answer.

Traffic: 2775 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6