Hi, this tool may save your time, it do filtering and QC with fastq data automatically
following introduction is out of date and the newer AfterQC is much more powerful, please check the github page for update
AfterQC
project on github: https://github.com/OpenGene/AfterQC
sample report: http://opengene.org/AfterQC/report.html
Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
AfterQC
can simply go through all fastq files in a folder and then output three folders: good, bad and QC folders, which contains good reads, bad reads and the QC results of each fastq file/pair.
Currently it supports processing data from HiSeq 2000/2500/3000/4000, X10, X5, Nextseq 500/550, MiniSeq...
Features:
AfterQC
does following tasks automatically:
- Filters reads with too low quality, too short length or too many N
- Filters reads with abnormal PolyA/PolyT/PolyC/PolyG sequences
- Does per-base quality control and plots the figures
- Trims reads at front and tail, according to QC results
- For pair-end sequencing data,
AfterQC
automatically corrects low quality wrong bases in overlapped area of read1/read2 - Detects and eliminates bubble artifact caused by sequencer due to fluid dynamics issues
- Single molecule barcode sequencing support: if all reads have a single molecule barcode (see duplex sequencing),
AfterQC
shifts the barcodes from the reads to the fastq query names - Support both single-end sequencing and pair-end sequencing data
Dependency:
AfterQC
uses editdistance
module, run following before using AfterQC
:
pip install editdistance
WARNING: If you haven't installed editdistance
module, AfterQC
will use a python implementation of editdistance, but it will be extremely slow.
Simple usage:
1, Prepare your fastq files in a folder
2, For single-end sequencing, the filenames in the folder should be *R1*
For pair-end sequencing, the filenames in the folder should be *R1*
and *R2*
cd /path/to/fastq/folder
python path/to/AfterQC/after.py
Two folders will be automatically generated, a folder 'good' stores the good reads and a folder 'bad' stores the bad reads
AfterQC
will print some statistical information after it is done, such how many good reads, how many bad reads, and how many reads are corrected.
Quality Control only
If you only want to get quality control statistics, run:
python after.py --qc_only
Understand the report
AfterQC
will generate a QC folder, which contains lots of figures.- For pair-end sequencing data, both read1 and read2 figures will be in the same folder with the folder name of read1's filename.
R1
meansread1
,R2
meansread2
. - For single-end sequencing data, it will still have
R1
. prefilter
meansbefore filtering
,postfilter
meansafter filtering
- For pair-end sequencing data,
After
will do anoverlap analysis
. read1 and read2 will be overlapped whenread1_length + read2_length > DNA_template_length
.
Hello,
I've got a few questions about the calcs in AfterQC. In the AfterQC paper, you note that "AfterQC can detect the mismatches in the overlapping regions. For those reads with very long overlap (i.e. overlap_len>50)".
In the estimated seq error field in the html report, are only overlaps greater than 50bp considered? And are the errors in these overlaps the only component that goes into the seq error rate calculation?
If only overlaps greater than 50bp go into the calculation, could you please let me know where should I change the source to modify that number (my guess is complete_compare_require in util.py)?
Thanks very much for the software!
Please don't post new questions in the answer section. New Questions need to be asked separately. This post will be moved to a comment.