Entering edit mode
7 months ago
Aki
▴
20
I did fastp using published fastq files of single-end RNA seq data, and I got 99.9999% of Q20 bases and 99.9999% of Q30 bases. I have never got this score. I am a beginner in this informatics field, so I don't know if it is normal. Could you give me any suggestions?
Detecting adapter sequence for read1...
No adapter detected for read1
Read1 before filtering:
total reads: 47471798
total bases: 4747179800
Q20 bases: 4747174600(99.9999%)
Q30 bases: 4747174600(99.9999%)
Read1 after filtering:
total reads: 47471746
total bases: 4557287616
Q20 bases: 4557287616(100%)
Q30 bases: 4557287616(100%)
Filtering result:
reads passed filter: 47471746
reads failed due to low quality: 52
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 0
bases trimmed due to adapters: 0
Duplication rate (may be overestimated since this is SE data): 60.5205%
JSON report: ./report/SRR23031659_fastp.json
HTML report: ./report/SRR23031659_fastp.html
fastp -i ./SRR23031659.fastq.gz -3 -o out_SRR23031659.fq.gz --html ./report/SRR23031659_fastp.html -j ./report/SRR23031659_fastp.json -q 15 -n 10 -t 1 -T 1 -l 20
fastp v0.23.4, time used: 96 seconds
Thanks in advance.
If the Q20 score is greater than 20, it will indicate higher probability of being correct. Similarly if Q30 score is also greater than 30, it will represent exceptional confidence accuracy. Please check out this blog.
Is this from a AVITI sequencer? They do have quite high quality scores.
Thanks jkim. They seem to use MGISEQ-2000RS (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6925047). Do you have any information on this model?
I have no idea. Good luck!
Thank you!
Some companies may change the Q value to some fixed values to save storage, do you know did they do something like that? This is my guess.
for sure illumina does this. they just upped it from 22 to 25 for certain calls etc. they base it on aggregated data then update the priors
the bottom line is if compression is a concern then they will lump together things in the 20s as like 22 or 25 or whatever the closest fit is, that kind of thing.
regarding 3rd gen, nanopore too reports estimated quality scores in place of empiric in certain cases (though recently comparison has justified the estimates) which implies similar practices though i can comment specifically on most recent practices (changing fast). dont know enough about pacbio to say
it means you are good at bioinformatics. keep doing things