I'm writing a pipeline and i want to implement a check on the quality of the reads.
Using FastQC, i got the basic statistical data (mean, quartiles etc...) from the "Per base sequence quality" tab and now i want to check if the quality of this sample's reads is sufficient to go on with the analysis.
Since the pipeline is going to analyze batches of samples at a time, I thought about build a model using every sample but the one i want to test. Is it a good way to assess the quality of data or am i completely wrong? How can i build such a model?
Just my two cents, wouldn't it be more practical to simply aggregate these reports with multiqc and then look at them manually. Automatisation is of course handy but in my experience a sane look with the human eye at summary statistics often helps much more than a machine-produced output to judge the quality of a sample. A solid model would need good validation to be sure it catches odd samples or outliers, which is pretty trivial using a per-eye method.
Also, the quality of a sample is more complex that the fastqc report. It should also involve assessment of mapping percentage and more special characteristics such as signal-to-noise ratio. I do not think your approach will be enough to judge the quality. In fact, I most of the time do not even look at fastqc if mapping percentage is good. Rather, e.g. for ChIP-seq I check the samples on a genome browser visually, calculate FRiPs (fractions of reads per peak) and see if the data show signs of batch effects using PCA. Automatisation is all nice but I hit the wall a couple if time in the past with this, judging samples as good based on some automated quality metrics but in the end rejecting the samples after checking everything more thoroughly including by-eye methods on a genome browser.
Still I do not work in a clinical or core-setting where the sheer number of samples might make this manual checking impossible.
I work in a clinical environment as you guessed, so manual checking is not really a possibility. The aim of this check is to find really bad samples and not waste time analyzing them. Can I use a simple statistical test to get that?
Maybe a minimum cutoff towards the read count makes sense and maybe a cutoff towards duplicated reads as excessive duplication might indicate poor library complexity. Beyond that we would need more details on what type of data you have.
If that is the case is it wise to depend on a program/automated test to make a decision? I assume you just want the relevant samples flagged so you can manually examine the results to decide if there was no other issue with the analysis.
Yes, just the "obviously" bad samples need to be stopped from continuing processing while maybe adding a flag for ambiguous samples would be best allowing for human supervision.
I work in a clinical environment as you guessed, so manual checking is not really a possibility. The aim of this check is to find really bad samples and not waste time analyzing them. Can I use a simple statistical test to get that?
Maybe a minimum cutoff towards the read count makes sense and maybe a cutoff towards duplicated reads as excessive duplication might indicate poor library complexity. Beyond that we would need more details on what type of data you have.
At this step, i only have the fastq from the sequencing. I was hopping to use the FastQC reports to get an idea of the read quality.
If that is the case is it wise to depend on a program/automated test to make a decision? I assume you just want the relevant samples flagged so you can manually examine the results to decide if there was no other issue with the analysis.
Yes, just the "obviously" bad samples need to be stopped from continuing processing while maybe adding a flag for ambiguous samples would be best allowing for human supervision.