how to find Q20, Q30, Q40 ,... of a fastq file?
3
2
Entering edit mode
8.5 years ago
ahmad.iut ▴ 90

Hello there I want to determine %>q20 , Q30, Q40 of some fastq files produced by illumina hiseq 2000. Do you have any idea?

Thanks

sequencing RNA-Seq • 13k views
ADD COMMENT
2
Entering edit mode

Try @Matt Shirley's fastqp tool (https://github.com/mdshw5/fastqp).

ADD REPLY
1
Entering edit mode

+1 for fastqp! :) This tool looks awesome!

ADD REPLY
1
Entering edit mode

I wish @Matt would convert the output (or at least make an option available) to create a standalone html file like FastQC.

ADD REPLY
0
Entering edit mode

Hahah, you just described the reason I made my QC tool SeQC :P (which is for BAM files so not really useful here) FastQC frustrates me because getting the data out means parsing some html. I personally prefer tools like Matt's because the plotting is left up to me. However, some basic plotting functionality should be built-in to speed things up. Once I finish my PhD i'll finish up SeQC and hopefully import Matt's stats as modules so we get the best of both worlds. It's taken me so long to write SeQC that the name has now been taken by 2 or 3 other projects, so perhaps i'll have to rename it to SlowQC or something.

ADD REPLY
1
Entering edit mode

Hate to hijack this thread but since we are on the topic: In test I ran @Matt's program just produced a bunch of png files. So that is the default output. I have not gone back to explore other options yet. It also does not seem to pickup sample names from fastq files (like FastQC) to label the output zip folder automatically.

ADD REPLY
0
Entering edit mode

FastQC provides best of both worlds. Which would be nice to get from fastqp.
For some who want/like the portability there is the html single file report. For users like you there is a zip archive with all underlying data (in text) for re-plotting/whatever.

ADD REPLY
0
Entering edit mode

Oh awesome! :) I knew there was a zip file, but that used to extract to a directory for the browser to navigate. I pleaded with I think Simon to make the html a single file so it would work better with a web-based logging program I was working on (that renames all files to their MD5 sum, and doesn't support directories), and after that change took place I just assumed the zip file just contained only the new single html file (but compressed). I never looked, hehe - time to check it out! Thanks :)

ADD REPLY
0
Entering edit mode

Tagging @Matt so he sees this thread: Matt Shirley

ADD REPLY
3
Entering edit mode
8.5 years ago

Use Fastqc. It is a very popular program easily found with a single search. It will give you the mean, median, quartiles and statistics of every position

ADD COMMENT
1
Entering edit mode

Thanks Antonio, but I want to have the %>q30 for all reads and positions. The sequencing company sent me a report that say Q20=98% I want to calculate it again by myself. also calculate q30 and q40. By the way, I want to make my Q20 of my data near to 100. any idea?

ADD REPLY
2
Entering edit mode

Sum the data from FastQC.

ADD REPLY
1
Entering edit mode

I wonder why you want to make all your data Q20. A quality higher than Q20 is even better

Did you sequence using Ion Torrent ?

ADD REPLY
1
Entering edit mode

I just want to evaluate all reads with one score. I sequenced by Illumina hiseq 2000. I am interested to know what percent of my data has more score than Q20, what percent more than q30 and so on. I have done fastqc but I am curios to find out q20,... of my data. Is there any tool or script?

ADD REPLY
0
Entering edit mode

Open the zip-file you obtained when you run FastQC, you should find all the raw data for the plots you saw in the HTML. Open fastqc_data.txt. Locate the data for the plot "Per sequence quality scores". Sum the count values from rows base on your needs.

ADD REPLY
1
Entering edit mode
8.5 years ago
Picasa ▴ 650

Did you try Biopython ??

ADD COMMENT
1
Entering edit mode

Hi Picasa, No I didn't try it, is there any script for that?

ADD REPLY
0
Entering edit mode

You have to write your own script but it's not difficult since this package can manipulate fastq easily

ADD REPLY
0
Entering edit mode

If it's really not difficult to write this script, then you should post it.

If you can't post it, then it can't be so easy.

ADD REPLY
0
Entering edit mode

You pay me for doing your job ? if yes Ill post ...

ADD REPLY
2
Entering edit mode

Neither of you are forced or paid to help other people on the internet. You are both doing it out of the kindness of your hearts. Please don't forget that many here think you're both awesome, whether you have the time to help right now or not, and the real enemy is the data.

ADD REPLY
1
Entering edit mode
8.5 years ago

One should be careful about "average quality score" for an entire file since you could still have a subset of sequences that may be hidden outliers ("bad") in an otherwise "good" file. Take a look to this WEB PAGE to understand

If still are interested, the "Compute quality statistics" function in Galaxy (e.g. in https://main.g2.bx.psu.edu) will do it, and maybe one of the utilities provided by the fastx-toolkit

ADD COMMENT
1
Entering edit mode

Thanks again Antonio, You are right, and I agree that average quality score is not good enough. I have done fastQC for my data and have quality for each position.Everything is ok with that. I wanted an overall evaluation of my data out of curiosity. Finally, I found some tools for this purpose. "FaQCs" is one of them that works fine.

ADD REPLY

Login before adding your answer.

Traffic: 1229 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6