What Are Some Sanity Checks That Should Be Performed On Ngs Data?
3
17
Entering edit mode
14.0 years ago

I am compiling a wish list of analyses that should be run for every data set coming out of a sequencing facility - regardless of whether this sequence is for RNA-Seq, SNP calling, ChIP-Seq, or possibly de novo sequencing. The goal is to scan for potential red flags that would possibly indicate something has gone awry either in the lab or downstream. I want a list of "sanity checks" that will encompass both sequence quality analysis as well as what can be gleamed from alignments.

For example,

  • Sequence QA - basecalling bias, read quality, yield, throughput, GC bias, 5'/3' motifs?, restriction enzyme bias

  • Barcode distribution (if barcoded)

  • Alignment QA

    • chromosome bias,
    • annotational biases (whether experimentally induced or not)
      • genes, repeats, cpg islands, epigenetic markers, expression

I am sure this has already been implemented at a lot of the bigger sequencing cores - I just need a definitive list. Of course, many of these sanity checks will be triggered by the experiments themselves - the point is to develop a comprehensive checklist of analyses that will encompass both what we expect to see as well as what we don't.

next-gen sequencing pipeline quality • 6.0k views
ADD COMMENT
3
Entering edit mode

Any final words on the final definitive list? I have been working on a way to show "positional diversity" in FastQ reads: http://code.google.com/p/seqdiverse/wiki/SeqDiverse Basically an analysis of the diversity of k-mers.

ADD REPLY
0
Entering edit mode

great program, although I think raw tabular output would be welcome to developers

ADD REPLY
1
Entering edit mode

great question! also interested to see strand bias relative to annotations for RNA-Seq.

ADD REPLY
1
Entering edit mode

@Bio_X2Y, that hexamer bias is visible in the FastQC output.

ADD REPLY
0
Entering edit mode

I suppose a kmer analysis for every dataset would not be unreasonable

ADD REPLY
0
Entering edit mode

I'm not sure if there's value in checking for this, but different platforms can introduce different biases. E.g. Illumina's random priming isn't really random: http://www.ncbi.nlm.nih.gov/pubmed/20395217

ADD REPLY
0
Entering edit mode

@brentp - thanks! I was aiming to highlight that platform-specific biases exist, I just used this as an example because I don't know of any others :)

ADD REPLY
15
Entering edit mode
14.0 years ago
Bio_X2Y ★ 4.4k

We use FASTQC to perform a barrage of quality checks - you might get some useful ideas there.

We also quantify the amount of rRNA reads in our Illumina GA datasets - we hope to see around 4-6%.

ADD COMMENT
3
Entering edit mode

+1 for FASTQC, it's the starting point for all our analyses.

ADD REPLY
1
Entering edit mode

People landing here, check out multiqc - it works with fastqc to make a nice combined report for all reads in a directory.

ADD REPLY
0
Entering edit mode

Yes I am seeing a few checks i hadn't listed as well as some great ideas for visualization -sequence length distribution -sequence duplication levels -overrepresented sequences

ADD REPLY
7
Entering edit mode
14.0 years ago
Aaron Statham ★ 1.1k

In addition to running FASTQC on every lane of sequencing, in my mapping pipeline I record the number of

  • Raw (purity filtered) reads
  • Unmappable reads
  • Multimapping reads
  • Uniquely mapping reads
  • Final number of reads after removing duplicates with Picard

These metrics tell us a few things eg certain types of experiments you expect to have more multimapping reads (DNA methylation pull downs), and the % of reads which are removed as duplicates really goes up when we're scraping the bottom of the tube when it comes to how much template we manage to get into library prep. Of course interpretation of these numbers really depends on the biological experiment going on.

ADD COMMENT
0
Entering edit mode

I'm working with viral samples and I've running against this duplicate read problem. Do you know of a reference where this low concentration of DNA and high duplicate read count is discussed? Many thanks.

ADD REPLY
2
Entering edit mode
13.5 years ago
Paige ▴ 40

I definitely agree with the above post. Another very useful metric is the library complexity...this can be generated by running Picard's MarkDuplicates tool.

ADD COMMENT
1
Entering edit mode

I agree that library complexity is an important metric. However, simply counting duplicates is perhaps an overly simplistic assessment of library complexity. At the very least the number of duplicates should be relative to the total number of reads in the library...

ADD REPLY

Login before adding your answer.

Traffic: 1493 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6