I've been working on NGS data quality control (on Whole Genome Bisulfite Sequencing data of IMR90 stem cells, to be accurate) using the FASTX Toolkit. I am struggling at finding the optimal parameters for the fastqqualityfilter command-line, which is part of my QC pipeline, and documented as follows :
$ fastq_quality_filter -h
usage: fastq_quality_filter [-h] [-v] [-q N] [-p N] [-z] [-i INFILE] [-o OUTFILE]
version 0.0.6
[-h] = This helpful help screen.
[-q N] = Minimum quality score to keep.
[-p N] = Minimum percent of bases that must have [-q] quality.
[-z] = Compress output with GZIP.
[-i INFILE] = FASTA/Q input file. default is STDIN.
[-o OUTFILE] = FASTA/Q output file. default is STDOUT.
[-v] = Verbose - report number of sequences.
If [-o] is specified, report will be printed to STDOUT.
If [-o] is not specified (and output goes to STDOUT),
report will be printed to STDERR.
Any clues about a way to find out (given quailty statistics) the optimal value for -p and -q ?
Thank you !
Can you give any comments on what the gold standard is. For example, all reads needing to be above a Phred score of 32?