Question

Fastq Quality Check

5

Entering edit mode

13.7 years ago

toshnam ▴ 650

Hi all,

I'm trying to check sequencing quality of FASTQ file from HiSeq2000. I used fastx_quality_stats script of FASTX-Toolkit (Version 0.0.13) for it. However I've got an error as follows:

$ fastx_quality_stats -i 6_1.fastq -o 6_1.stats <br />
fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4

The FASTQ file really contains "#" character.

@HWI-ST621:210:C03D4ACXX:4:1101:1475:1957 1:N:0:ATCACG
NACTACAATTTACAGATAACTTTAAATTAAATTTTGGAATCAAATATAAAGATTGAAAATGAATTTTGAATATATGAAAATCCATTTAAAGAGTTTGGTAC
+
#1=DDDFFHHDHHIIIJJEHIJJJJJIIIJFIGGJJJFICGIGGGIIJIEIIIIJIJIIIIHIIIJIGGIJIIIJGHIEHJJJHHHHHHHFFF;B@CA;;@

"#" charater is invalid quality score value? I heard this FASTQ file was checked using quality trim program of NGS Cell package of CLCBio, and sequencing quality was good. Then, "#" character is invalid for FASTX-Toolkit only?

I also used Popoolation toolbox (Version 1.2.2) for quality trimming of the FASTQ, and I've got some results as follows:

$trim-fastq.pl --input1 6_1.fastq --input2 6_2.fastq --output trimmed

......................................................

FINISHED: end statistics
Read-pairs processed: 53675033
Read-pairs trimmed in pairs: 0
Read-pairs trimmed as singles: 0


FIRST READ STATISTICS
First reads passing: 0
5p poly-N sequences trimmed: 632578
3p poly-N sequences trimmed: 0
Reads discarded during 'remaining N filtering': 0
Reads discarded during length filtering: 53675033
Count sequences trimed during quality filtering: 53675033

Read length distribution first read
length  count


SECOND READ STATISTICS
Second reads passing: 0
5p poly-N sequences trimmed: 628623
3p poly-N sequences trimmed: 801
Reads discarded during 'remaining N filtering': 0
Reads discarded during length filtering: 53675033
Count sequences trimed during quality filtering: 53675033

Read length distribution second read
length  count

As you see, all of reads were trimmed during the process of quality trimming.
I've been working with some GAII and HiSeq2000 sequence data, but this is the first case. I wonder whether this problem was caused by bad sequencing quality or my mistake.

I appreciate any help.
Thanks.

fastq fastx • 22k views

ADD COMMENT • link updated 13.7 years ago by Rm 8.3k • written 13.7 years ago by toshnam ▴ 650

2

Entering edit mode

Solution 1. Use an alternative program such as FastQC. Solution 2. Use -Q33 option on Fastx-Toolkit. Thanks, guys :-)

ADD REPLY • link 13.7 years ago by toshnam ▴ 650

1

Entering edit mode

Solution 1. Use an alternative program such as FastQC. Solution 2. Use -Q33 option on Fastx-Toolkit.

ADD REPLY • link 13.7 years ago by toshnam ▴ 650

score 7 · Answer 1 · 2011-10-28

7

Entering edit mode

13.7 years ago

Rm 8.3k

Try adding -Q33 option to fastx command and run...

fastx_quality_stats -Q33 i 6_1.fastq -o 6_1.stats

ADD COMMENT • link 13.7 years ago by Rm 8.3k

score 4 · Answer 2 · 2011-10-28

4

Entering edit mode

13.7 years ago

toni ★ 2.2k

It seems to be a problem of quality encoding in your file.

Apparently (35-64 = -29) fastx toolkit suppose that your file is in Illumina 1.3+ encoding, whereas your file seems to be in Sanger encoding which has an offset of 33 instead of 64.

Read this for further information on quality scores encoding :

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

There may exist options in fastx-toolkit to handle this.

ADD COMMENT • link 13.7 years ago by toni ★ 2.2k

0

Entering edit mode

Thank you for your comment. I know the latest fastx-toolkit can read both fastq type, sanger and solexa, basically (Please refer to update news on fastx-toolkit homepage). Also, I confirmed manual of fastx_quality_stats and couldn't find any option for this problem.

ADD REPLY • link 13.7 years ago by toshnam ▴ 650

score 2 · Answer 3 · 2011-10-28

2

Entering edit mode

13.7 years ago

pmenzel ▴ 310

Yes, fastx toolkit doesn't work with the quality scores of some versions of the Illumina software.

ADD COMMENT • link 13.7 years ago by pmenzel ▴ 310

4

Entering edit mode

fastx toolkit can use other quality scores, it isn't documented, but with e.g. -Q33 one can use Sanger encoded data.

ADD REPLY • link 13.7 years ago by Jan Van Haarst ▴ 300

2

Entering edit mode

Check fastQC which is good and guess the encoding internally. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

ADD REPLY • link 13.7 years ago by toni ★ 2.2k

2

Entering edit mode

FastQC is very popular : http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

ADD REPLY • link 13.7 years ago by User 59 13k

0

Entering edit mode

Really? Can you recommend any alternative free program to check sequencing quality of my fastq?

ADD REPLY • link 13.7 years ago by toshnam ▴ 650

0

Entering edit mode

Thanks, tony and daniel. FastQC is working well with my FASTQ file.

ADD REPLY • link 13.7 years ago by toshnam ▴ 650

0

Entering edit mode

Thanks, Jan. I confirmed "-Q33" option is working well with my FASTQ file.

ADD REPLY • link 13.7 years ago by toshnam ▴ 650

0

Entering edit mode

+1 for fastqc, love it.

ADD REPLY • link 13.7 years ago by pmenzel ▴ 310

0

Entering edit mode

thanks Jan, didn't know that too.

ADD REPLY • link 13.7 years ago by pmenzel ▴ 310

0

Entering edit mode

I also like a lot SolexaQA http://solexaqa.sourceforge.net/

ADD REPLY • link 13.7 years ago by Marina Manrique ★ 1.3k

0

Entering edit mode

Note to commenters: Try to avoid using the comments as a place to answer the question. In this case the answer is what Jan van Haars mentions, that one needs to to pass the option -Q33 to the tool. Comments are for asking clarifications.

ADD REPLY • link 13.7 years ago by Istvan Albert 102k