Question

What Are Some Sanity Checks That Should Be Performed On Ngs Data?

17

Entering edit mode

14.4 years ago

Jeremy Leipzig 23k

I am compiling a wish list of analyses that should be run for every data set coming out of a sequencing facility - regardless of whether this sequence is for RNA-Seq, SNP calling, ChIP-Seq, or possibly de novo sequencing. The goal is to scan for potential red flags that would possibly indicate something has gone awry either in the lab or downstream. I want a list of "sanity checks" that will encompass both sequence quality analysis as well as what can be gleamed from alignments.

For example,

Sequence QA - basecalling bias, read quality, yield, throughput, GC bias, 5'/3' motifs?, restriction enzyme bias
Barcode distribution (if barcoded)
Alignment QA
- chromosome bias,
- annotational biases (whether experimentally induced or not)
  - genes, repeats, cpg islands, epigenetic markers, expression

I am sure this has already been implemented at a lot of the bigger sequencing cores - I just need a definitive list. Of course, many of these sanity checks will be triggered by the experiments themselves - the point is to develop a comprehensive checklist of analyses that will encompass both what we expect to see as well as what we don't.

next-gen sequencing pipeline quality • 6.6k views

ADD COMMENT • link updated 6.2 years ago by Biostar 20 • written 14.4 years ago by Jeremy Leipzig 23k

3

Entering edit mode

Any final words on the final definitive list? I have been working on a way to show "positional diversity" in FastQ reads: http://code.google.com/p/seqdiverse/wiki/SeqDiverse Basically an analysis of the diversity of k-mers.

ADD REPLY • link 13.8 years ago by Justin Brown ▴ 40

0

Entering edit mode

great program, although I think raw tabular output would be welcome to developers

ADD REPLY • link 13.1 years ago by Jeremy Leipzig 23k

1

Entering edit mode

great question! also interested to see strand bias relative to annotations for RNA-Seq.

ADD REPLY • link 14.4 years ago by brentp 24k

1

Entering edit mode

@Bio_X2Y, that hexamer bias is visible in the FastQC output.

ADD REPLY • link 14.4 years ago by brentp 24k

0

Entering edit mode

I suppose a kmer analysis for every dataset would not be unreasonable

ADD REPLY • link 14.4 years ago by Jeremy Leipzig 23k

0

Entering edit mode

I'm not sure if there's value in checking for this, but different platforms can introduce different biases. E.g. Illumina's random priming isn't really random: http://www.ncbi.nlm.nih.gov/pubmed/20395217

ADD REPLY • link 14.4 years ago by Bio_X2Y ★ 4.4k

0

Entering edit mode

@brentp - thanks! I was aiming to highlight that platform-specific biases exist, I just used this as an example because I don't know of any others :)

ADD REPLY • link 14.4 years ago by Bio_X2Y ★ 4.4k

score 15 · Answer 1 · 2010-12-14

15

Entering edit mode

14.4 years ago

Bio_X2Y ★ 4.4k

We use FASTQC to perform a barrage of quality checks - you might get some useful ideas there.

We also quantify the amount of rRNA reads in our Illumina GA datasets - we hope to see around 4-6%.

ADD COMMENT • link 14.4 years ago by Bio_X2Y ★ 4.4k

3

Entering edit mode

+1 for FASTQC, it's the starting point for all our analyses.

ADD REPLY • link 14.4 years ago by brentp 24k

1

Entering edit mode

People landing here, check out multiqc - it works with fastqc to make a nice combined report for all reads in a directory.

ADD REPLY • link 6.3 years ago by chris86 ▴ 400

0

Entering edit mode

Yes I am seeing a few checks i hadn't listed as well as some great ideas for visualization -sequence length distribution -sequence duplication levels -overrepresented sequences

ADD REPLY • link 14.4 years ago by Jeremy Leipzig 23k

score 7 · Answer 2 · 2010-12-15

In addition to running FASTQC on every lane of sequencing, in my mapping pipeline I record the number of

Raw (purity filtered) reads
Unmappable reads
Multimapping reads
Uniquely mapping reads
Final number of reads after removing duplicates with Picard

These metrics tell us a few things eg certain types of experiments you expect to have more multimapping reads (DNA methylation pull downs), and the % of reads which are removed as duplicates really goes up when we're scraping the bottom of the tube when it comes to how much template we manage to get into library prep. Of course interpretation of these numbers really depends on the biological experiment going on.

score 2 · Answer 3 · 2011-06-16

2

Entering edit mode

13.9 years ago

Paige ▴ 40

I definitely agree with the above post. Another very useful metric is the library complexity...this can be generated by running Picard's MarkDuplicates tool.

ADD COMMENT • link 13.9 years ago by Paige ▴ 40

1

Entering edit mode

I agree that library complexity is an important metric. However, simply counting duplicates is perhaps an overly simplistic assessment of library complexity. At the very least the number of duplicates should be relative to the total number of reads in the library...

ADD REPLY • link 13.5 years ago by Malachi Griffith 20k