Entering edit mode
3.4 years ago
William
★
5.3k
What are your favorite tools in 2021 for pre and post alignment / variant calling QC tools?
Especially if you are dealing with many large whole genome sequenced samples.
My setup is currently:
FASTQ
- FASTQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- FASTP: https://github.com/OpenGene/fastp
BAM:
- Samtools stats http://www.htslib.org/doc/samtools-stats.html
- QualiMap: http://qualimap.conesalab.org/
VCF
- BCFTOOLS stats http://samtools.github.io/bcftools/bcftools.html#stats
Summary QC report over many samples and FASTQ/BAM/VCF reports
- MultiQC https://multiqc.info/
This does a reasonable job. But a few things could be improved:
- MultiQC works well for few to medium set of samples, but not for 500+ samples. Report becomes difficult to interpret, and difficult so select samples with issues and drill down to these samples with e.g. sequencing issues . All you get is bee-swarm plot without the possibility to find out which samples have strange QC values.
- Samtools stats does not report coverage by it self. You need to calculate it your self by dividing bases mapped by genome size
- Qualimap makes nice reports, but it's (Java) CPU and memory usage is unreasonable, especially for large genomes and many samples
- Qualiamp reports are difficult to summarize over many samples in MultiQC
- BCFTools stats can't output the desired sample stats in 1 pass. 1 pass per sample gives the best stats that can be loaded in MultiQC (taking long with 100GB+ BCF file read many hundreds of times, or first splitting the multi-sample file to single sample BCF)
So I am wondering what other people use, which tools and how do you summarize the QC results over many samples, with stilling being able to drill down to samples with issues.
It sounds like your use case is an outlier. Few people have hundreds of samples so the programs you mention above likely work reasonably well for most.
How about:
It is possible to increase the multiqc sample number limits for interactive plots and tables. By creating a modified ~/.multiqc_config.yaml in your home dir, see example https://github.com/ewels/MultiQC/blob/44f28ef0726bc65fd965aa99d5a19f7745c749c4/test/config_example.yaml The limit for the interactive table was 500, so I was just above it. After increasing the sample limit to 5000 the interactive table works with 500+ samples, but the interactive plots are very slow, and the entire html is very slow.