This is more a proposal to collaborate on making a new QC tool, or if you will a description of a hypothetical tool that possibly does not exist.
Background: FastQC is used by many as a QC tool for sequencing data, and it has its merits. However, its evaluation is based on theoretical assumptions of how sequencing data should behave, not how they are in reality. Also, since FastQC has been first implemented, a lot has been learned about real sequencing data, and also the technology has advanced massively. As an example, a lot of questions on BioStars are related to the analysis of base composition in FastQC, that when applied to RNA-seq data from random priming regularly indicates a QC failure. The traffic light system to summarize quality is also equally suggestive, simplistic and misleading at the same time, because it does not take into account how other data sets look.
I would like to propose a different approach instead, based on empirical data, that is similar to the quality rating of protein 3D structure in PDB. Data would be analyzed and compared with data from comparable sequencing experiments, and then summarized by quantile of the statistics in comparison to other data sets. There is ample (possibly too much) data in SRA that could be used.
Let me know your critique.
Brainstorming
- This needs to be a collaborative project, because of the compute requirements
- The project needs to contain a "survey" phase, where data is analyzed and pre-computed stats are made.
- Need to agree on a set of summary stats to define quality
- Define an exchange format (e.g. in SQLite) for QC so that the data can be computed on different nodes and exchanged between them.
- Implement a distributed application that user can easily install (like SETI@home), scans a certain range of SRA address space and delivers the output back.
- Possibly a lot of stats are already available in SRA
Great idea! I've started hiding the FastQC results in a nicer multiQC output so end users stop seeing the traffic light and don't then worry about it. This would be a nice next step.
Good idea. While it awaits implementation, it would be useful to modify FastQC such that an appropriate "limits" file can be selected depending on type of sequencing at hand so the display of "warnings/failures" could be modified. That would minimize accidents due to the traffic lights that appear to trip many new users.
Great. Possibly we can also use modified FastQC code for final summary stats, but it needs to be "real fast" to summarize 1000's of files per organism and work on sra files directly, ideally streaming them.
We should select data of known provenance from SRA. Biostar users could nominate their own datasets since they presumably know them well and are confident about their quality/utility. Others can cross-check and approve.
An important aspect of such a study would be to generate representative samples, not only the ones that are believed to be good. Samples also need to cover a wide range of technologies and species.
+1 from me. One thing I like of FastQC is that it works with minimal input, just give it a fastq file and that's it. This comes at cost of simplistic output of course. Still I think it's important for a QC tool to require minimal configuration, or at least have default settings where all you need is the input data you want to QC. This way the tool is easy and useful also when you want a quick and dirty assessment of your data.