Question

What Are The Best Qc Tools For Exome Seq, Rna Seq, Single/Paired End, Etc?

3

Entering edit mode

11.1 years ago

bw. ▴ 260

What tools are people using for QC'ing the initial stages of NGS data analysis?

I'm currently using FastQC and picard - but have been thinking of ways to improve on this (which led to the feature list below). I would appreciate tool recommendations and/or comments on features you like or don't like or feel are missing in existing tools.

Thanks, -Ben

Ps. List of desired features:

FastQC-style
- user-friendly web interface
- very easy to download, install, run
Picard-style
- has stats for both single and paired-end data, and for both unaligned (eg. %GC content) and aligned (eg. insert size, coverage)
- outputs data tables that are easy to read into R, etc.
Extensibility
- would be nice to have a simple way to add custom graphs and stats without having to recompile. Though Picard and FastQC are open source and modular, they are written in Java which adds complexity and reduces the potential developer base relative to a language like python. Perhaps a better options could be a toolkit that consists of language-agnostic sub-executables that, for example, write out image and data-table files to disk and pass some message to the parent process that allows it to combine all the data pieces into a coherent webpage.
Other
- would be good if the tool gave suggestions on typical causes for problems like excessive duplication or kmer spikes, instead of just reporting graphs/stats/warnings.
- the tool could simultaneously take .fastq and one or more processed bam files from a single sample, and then show both unaliged and aligned stats in one interface, as well as how the stats change as a result of the different processing steps in the NGS pipeline.

sequencing qualitycontrol pipeline • 5.3k views

ADD COMMENT • link updated 11.1 years ago by Michele Busby ★ 2.2k • written 11.1 years ago by bw. ▴ 260

score 1 · Answer 1 · 2013-11-05

1

Entering edit mode

11.1 years ago

Sean Davis 27k

Sounds like you might want to learn some R or python, both of which can ingest BAM files and have graphics and stats capabilities. In R, the Rsamtools, ShortRead, and GenomicRanges packages are of interest. In python, look at pysam, pandas/numpy, matplotlib, and ggplot2.

ADD COMMENT • link 11.1 years ago by Sean Davis 27k

score 1 · Answer 2 · 2013-11-05

Personally, I've been working on CG-Pipeline and its raw read metrics and assembly metrics scripts. I think that for command line, these two scripts are very user-friendly. I update the source code very regularly and so there might be some newer features in my SVN branch.
https://sourceforge.net/projects/cg-pipeline/
https://sourceforge.net/p/cg-pipeline/code/HEAD/tree/cg_pipeline/branches/lkatz/

It doesn't give you graphs, but it does output tables. Below is the help that the scripts generate when you run them without options. Example usage is given, and complete command line options are shown. My scripts are more thorough and so they are slower, but that is offset by being multithreaded and by the --fast option in the readMetrics script.

And then, for what to do about bad reads, I have made several read cleaning scripts in CG-Pipeline. I describe them here:
https://sourceforge.net/apps/mediawiki/cg-pipeline/index.php?title=In_depth#lower-level_assembly_scripts

Basically, you can remove duplicate reads; trim and remove based on quality scores; and/or trim based on ACGT skew. Please let me know if you have any questions.

@login CGP]$ run_assembly_readMetrics.pl
run_assembly_readMetrics.pl: PipelineRunner::main: Prints useful assembly statistics
  Usage: run_assembly_readMetrics.pl reads.fasta
         run_assembly_readMetrics.pl reads.fasta | column -t
    A reads file can be fasta, sff, or fastq
    The quality file for a fasta file reads.fasta is assumed to be reads.fasta.qual
  --fast for fast mode: samples 1% of the reads and extrapolates
  -n 1 to specify the number of cpus (default: all cpus)
  --qual_offset 33
    Set the quality score offset (usually it's 33, so the default is 33)
  --minLength 1
    Set the minimum read length used for calculations
  -e 4000000 expected genome size, in bp
  --hist to generate a histogram of the reads


@login CGP]$ run_assembly_metrics.pl
run_assembly_metrics.pl: PipelineRunner::main: Prints useful assembly statistics
  Usage: run_assembly_metrics.pl assembly1.fasta [-e expectedGenomeLength] [-m minContigLength]
  OPTIONS
  -e genome length
    helps with N50 calculation but not necessary
  -m size
    only consider contigs of size m in the calculations
  -number
    Output only the number of the metric(s) you supplied and not the header. Works well with -s
  --numcpus 1 The number of threads to use
  -s stat (in contrast, see -a)
    only print out the value for this stat (or these stats) only.
    Possible values: N50, genomeLength, longestContig, numContigs, avgContigLength, assemblyScore, minContigLength, expectedGenomeLength, kmer21
    Default output: genomeLength, N50, numContigs, assemblyScore
  -a to output all stats. To select only some stats, use -s
  The assembly score is calculated as a log of (N50/numContigs * percentOfGenomeCovered)
    The percent of the genome covered is 1 if you do not supply an expected genome length.
    The percent of the genome covered counts against you if you assemble higher than the expected genome size.
  EXAMPLES
    run_assembly_metrics.pl *.fasta | sort -k 5 -n | column -t # pretty output and sorted
    run_assembly_metrics.pl *.fasta -a | column -t             # get all stats
    N50=`run_assembly_metrics.pl assembly.fasta -n -s N50`     # set a bash variable equal to an assembly's N50

score 1 · Answer 3 · 2013-11-05

1

Entering edit mode

11.1 years ago

Michele Busby ★ 2.2k

For RNA Seq you can look at David Deluca's RNASeq QC: http://www.broadinstitute.org/cancer/cga/rna-seqc

ADD COMMENT • link 11.1 years ago by Michele Busby ★ 2.2k