Question

How To Convert Fastq/Sff To Fasta With Quality Cut Off

3

Entering edit mode

13.6 years ago

toshnam ▴ 650

Hi all~

I want to extract sequence from fastq/sff with quality cut off for avoiding poor quality read.
I checked sffinfo doesn't contain the option.
Thanks in advance.

fasta fastq • 7.3k views

ADD COMMENT • link updated 5.9 years ago by msimmer92 ▴ 310 • written 13.6 years ago by toshnam ▴ 650

score 6 · Answer 1 · 2011-04-08

6

Entering edit mode

13.6 years ago

Michael 55k

The FASTX toolkit can do this. First filter the fastq file for quality using fastq_quality_filter then run fastq_to_fasta

ADD COMMENT • link 13.6 years ago by Michael 55k

score 2 · Answer 2 · 2011-04-08

There are many tools that will examine sequence (phred) quality, and trim accordingly. Fastx suggested by Michael is one option, but you could also look into traditional tools like SeqClean etc. One danger of only looking at phred qualities is that you introduce a bias, since for 454, long homopolymers are likely to have lower quality than short ones.

454 SFF files are already quality trimmed, so extracting the untrimmed parts will rid you of a lot of the low quality stuff already. If you are willing to sacrifice a lot of sequence, you can also trim to a fixed number of flow cycles, Roche's sfffile tools will let you do this, I think, or it's easy to write your own. (454 has very good quality at the start of the read, but degenerates towards the end)

score 1 · Answer 3 · 2011-04-08

There are a couple of ways to remove low quality base calls. One is to filter out entire reads that do not pass some criteria. The fastq_quality_filter from the FastX Toolkit implements that approach. Since most reads will exhibit quality degradation toward the end, another approach is to trim off the ends of reads, either to a fixed length (fastx_trimmer in FastX toolkit), or by using some sort of quality threshold. I'm partial to using a quality threshold since it improves data quality while maximizing the amount of data retained. The Galaxy analysis system has a tool that lets you do this.

Ketil is correct in that you can introduce a bias, so you should consider where your data came from (what platform, genome, library prep, etc.) and what you want to do with it. For example, a ChIP-Seq or other "counting" experiment a small number of error may not make much difference. However, with variant analysis, the errors could start to look like low frequency SNPs. Even worse, with assembly, the errors are likely to cause mis-assembly or at best, slow things down a lot.

score 1 · Answer 4 · 2011-04-09

I haven't seen such option in sffinfo tool either but as ketil mentioned you can use sfffile tool for a quite similar purpose.

With the -t option you "will merge the given trimpoints with any existing trim points for the input read, writing the largest starting trimpoint and smallest ending trimpoint into the output SFF file" and with the -tr option you will "reset the trimpoints, using only the trimpoint information occurring in this file"

With sffinfo you can then take the FASTA file from the new ("trimmed") SFF file.

Other option (I don't know if that would be really useful) can be converting SFF file to SCF file with the command sff2scf and use the SCF file in consed (phred)

score 0 · Answer 5 · 2019-01-13

0

Entering edit mode

5.9 years ago

msimmer92 ▴ 310

Question: Isn't there a simple way of doing this in Samtools, nowadays? Let's say, if I have a fastq file and I wanted to filter out the reads with q<20. Thanks! (P.S. I'm working on ChIPseq, not RNA-seq)

ADD COMMENT • link 5.9 years ago by msimmer92 ▴ 310