QUESTION While generating per position quality score graphs, is it possible to suppress sub-sampling and report for an entire FASTQ input file?
BACKGROUND INFO I ask this because of my observation that both FastQC and FASTX_Toolkit sub-sample the input file, for reasons I assume are related to run-time?!
My conclusion is based on combining these 3 observations:
Observation 1. I ran EA-Utils' fastq-stats on the input = yields the lowest Q-score of 5, as shown in it's STDOUT below
fastq-stats SRR_BBsplit_Sm1021_Scrubbed.fq.gz
reads 11269161
len 100
len mean 94.7008
len stdev 10.8264
len min 50
phred 33
window-size 2000000
cycle-max 35
dups 1999313
%dup 17.7415
unique-dup seq 205300
min dup count 2
dup seq 1 2660 CTTTTTTGCACACTGAGATCATTAAAGGACCTCAT
dup seq 2 1897 CTTAAATTAGGTGTTATAAATTTGAAGTTAAGGTG
dup seq 3 1049 CACAAGTCTACATACTTAAATTAGGTGTTATAAAT
dup seq 4 1007 CTTGGTTCTCCTCCACAACAACAGCCTTGTTGGGT
dup seq 5 835 CTACAAGTCACCTCCTCCTCCAACACCAGTTTACA
dup seq 6 756 CTTGTATACAGGTGATGGTGGAGGAGGTGACTTGT
dup seq 7 750 CTACAATTCACCACCTCCTCCAACACCAGTTTACA
dup seq 8 723 CTCATCTCAATGAACATAACATAACATAACAAAGA
dup seq 9 717 CTTGTACACGTAAGTTGGTGATGGTGGAGGTGGTG
dup seq 10 691 CTCTGCTTCAAGAGGCATATGATGCACTTCATTTG
dup mean 10.7385
dup stddev 18.9257
qual min 5
qual max 41
qual mean 37.5495
qual stdev 3.7967
%A 29.1245
%C 22.2476
%G 19.4687
%T 29.1593
%N 0.0000
total bases 1067198823
Observation 2. I ran FASTX-Toolkit on the input using instructions at http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
zcat SRR.fastq.gz | fastx_quality_stats -o SRR.fastx_stats
fastq_quality_boxplot_graph.sh -i SRR.fastx_stats -o SRR.fastx_stats.png -t "Test"
FASTX-Toolkit results image shown below
add logo to picture
Observation 3. Finally I ran FastQC on the input (pretty standard)
fastqc SRR_BBsplit_Sm1021_Scrubbed.fq.gz
FASTQC results image shown below
Thank you! Stay sane! Stay safe!
PS. I do understand that sampling a large subset of reads (few millions) from a much larger number file (tens or hundreds of millions reads) is statistically very acceptable. I am not arguing against this established and valid convention. I'd simply like to know how my per position quality score graph would look if I were to not sub-sample at all, but look at the entire file. Is this possible currently with any off-the-shelf bioinformatics tool?
You used three different methods above so they likely sampled your data in different ways. Results look about the same. So what do you think will happen if no sampling occurred? Assuming original files has not been sorted/deduplicated/trimmed/otherwise changed.
Genomax, thanks for replying. Yes, I suspect sampling might be performed differently by each tool, or even random sub-sampling by same tool might return slightly different results depending on "seed" for randomness etc.
In any case, the text data from EA-Utils fastq-stats gives only aggregated min, max, stdev and mean - across all positions, not per position.
Furthermore, it does not return IQ range, quantiles or quartiles...for the sort of graph FastQC or FASTX-Toolkit returns. So it's hard to predict how exactly the plot will change... on a per position basis...
In the box and whiskers plot, if all data are included, I suppose the whiskers will extend out farther - question is how much farther for EACH position! I am assuming it will NOT be Q=5 for each position.... May be just at the 3'd end?! Rather than assume, I want to see...
I predict :
So it'd be nice to visualize a density plot of Q scores for each position. Is this do-able by parsing output of some pre-existing tool?