I am compiling a wish list of analyses that should be run for every data set coming out of a sequencing facility - regardless of whether this sequence is for RNA-Seq, SNP calling, ChIP-Seq, or possibly de novo sequencing. The goal is to scan for potential red flags that would possibly indicate something has gone awry either in the lab or downstream. I want a list of "sanity checks" that will encompass both sequence quality analysis as well as what can be gleamed from alignments.
For example,
Sequence QA - basecalling bias, read quality, yield, throughput, GC bias, 5'/3' motifs?, restriction enzyme bias
Barcode distribution (if barcoded)
Alignment QA
- chromosome bias,
- annotational biases (whether
experimentally induced or not)
- genes, repeats, cpg islands, epigenetic markers, expression
I am sure this has already been implemented at a lot of the bigger sequencing cores - I just need a definitive list. Of course, many of these sanity checks will be triggered by the experiments themselves - the point is to develop a comprehensive checklist of analyses that will encompass both what we expect to see as well as what we don't.
Any final words on the final definitive list? I have been working on a way to show "positional diversity" in FastQ reads: http://code.google.com/p/seqdiverse/wiki/SeqDiverse Basically an analysis of the diversity of k-mers.
great program, although I think raw tabular output would be welcome to developers
great question! also interested to see strand bias relative to annotations for RNA-Seq.
@Bio_X2Y, that hexamer bias is visible in the FastQC output.
I suppose a kmer analysis for every dataset would not be unreasonable
I'm not sure if there's value in checking for this, but different platforms can introduce different biases. E.g. Illumina's random priming isn't really random: http://www.ncbi.nlm.nih.gov/pubmed/20395217
@brentp - thanks! I was aiming to highlight that platform-specific biases exist, I just used this as an example because I don't know of any others :)