Question

Fastq files with very high per base sequencing quality score

0

Entering edit mode

8.4 years ago

Ivan S • 0

Hello,

I am currently working with Fastq files of exome sequencing with a coverage of 150x. After running FastQC tool on these files I observe quite high Quality Score values (~35 on average) with very narrow distribution across all positions. This seems a little suspicious to me. Since I have very little experience on this type of data I'd like to ask, Is it normal to observe such high Quality Score results??

Thank you for your help

sequencing Fastq Quality scores • 3.5k views

ADD COMMENT • link updated 5.0 years ago by Biostar 20 • written 8.4 years ago by Ivan S • 0

1

Entering edit mode

That is normal. You can even see that in the example FastQC report: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

ADD REPLY • link 8.4 years ago by igor 13k

0

Entering edit mode

Thanks a lot, I hadn't noticed this same tendency in the example report

ADD REPLY • link 8.4 years ago by Ivan S • 0

1

Entering edit mode

You can analyze the quality scores empirically if you want, via mapping; BBMap has several options for that:

bbmap.sh ref=hg19.fa in=reads.fq.gz mhist=mhist.txt qahist=qahist.txt qhist=qhist.txt

mhist generates a histogram of matches and mismatches by base position; qhist gives claimed and measured quality per position; and qahist gives the quality-score accuracy (claimed versus observed). Sometimes the quality scores are quite accurate, sometimes not; it depends on a lot of factors including luck. But if you suspect they are wrong, it's nice to validate that.

Note that humans, being diploid with a roughly 1/1000 SNP rate, have a noise floor of around 30dB for these testing methods - they work better on haploids. But they will still be fairly accurate up to Q30.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

That is not surprising, if the libraries are of good quality (and the read length is not > 150).

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

Suspicious data you say? ಠ_ರೃ

Can we see the clues too?

ADD REPLY • link 8.4 years ago by John 13k

0

Entering edit mode

It depends on which sequencing technology you have used. If your data is from Illumina HighSeq, I would say the quality is as expected. But if your data is from Nanopore, I would also think it is suspicious.

ADD REPLY • link 8.4 years ago by piet ★ 1.9k

score 1 · Accepted Answer · 2016-07-12

1

Entering edit mode

8.4 years ago

Brice Sarver ★ 3.8k

Data I've analyzed from current sequencing platforms usually have excellent per-base quality scores. Though not always the case, I see larger 'dips' in quality scores at the beginning and end positions much less frequently than back in the earlier Illumina/454 days. You probably just have good data!

ADD COMMENT • link 8.4 years ago by Brice Sarver ★ 3.8k