I've stumbled upon several shotgun studies where all sample bases are ? (so 63 in phred+33). The first time I thought they might have been tampered with, but having just downloaded samples from 7 studies and 5 of them end up like this makes me wonder.
I thought it might be a certain type of quality binning, but I can't seem to find any binning that goes that high. Here are some example studies where this happened (I haven't checked all sequences, but 100% of the sequences I've checked in those studies are all ???).
SRP403740
SRP423365
SRP424298
SRP425931
SRP434700
SRP410115
I see no apparent pattern; different sequencer models (Hi seq and Novaseq), and all these studies seem independent.
Does anyone have an explanation for this?
Maybe they applied Q30 filter before uploading data?
Even so, is it even possible for the sequencer to have an error rate of less than 1 in a million for literally 100% percent of the bases?
I mean, maybe they drop those reads with < Q30 quality.
In this case that would have been a <Q63 filter; that doesn't really make sense, no? Is it really possible get 30M reads with quality >63 out of a MiSeq?
And again, it's 100% of bases with the exact same quality score, and this has happened over several independent studies.
I mean, fastq format adds 33 to present quality, filter tools will also add 33 to the cutoff before filtering.
Here is an example:
I looked at the data from the example you posted and I see normal scores:
How did you retrieve the data?