If you are processing long-reads data (nanopore, PacBio, cyclone etc.), you can try fastplong, which is based on the widely used tool fastp, with much optimization specified for long-reads FASTQ data.
simple usage
fastplong -i in.fq -o out.fq
Both input and output can be gzip compressed. By default, the HTML report is saved to fastplong.html
(can be specified with -h
option), and the JSON report is saved to fastplong.json
(can be specified with -j
option).
examples of report
fastplong
creates reports in both HTML and JSON format.
- HTML report: https://opengene.org/fastplong/fastplong.html
- JSON report: https://opengene.org/fastplong/fastplong.json
get fastplong
install with Bioconda
conda install -c bioconda fastplong
download the latest prebuilt binary for Linux users
This binary was compiled on CentOS, and tested on CentOS/Ubuntu
# download the latest build
wget http://opengene.org/fastplong/fastplong
chmod a+x ./fastplong
or compile from source
fastplong
depends on libdeflate
and isa-l
for fast decompression and compression of zipped data.
# get source (you can also use browser to download from master or releases)
git clone https://github.com/OpenGene/fastplong.git
# build
cd fastplong
make -j
# Install
sudo make install
input and output
Specify input by -i
or --in
, and specify output by -o
or --out
.
- if you don't specify the output file names, no output files will be written, but the QC will still be done for both data before and after filtering.
- the output will be gzip-compressed if its file name ends with
.gz
## output to STDOUTfastplong
supports streaming the passing-filter reads to STDOUT, so that it can be passed to other compressors likebzip2
, or be passed to aligners likeminimap2
orbowtie2
. - specify
--stdout
to enable this mode to stream output to STDOUT ## input from STDIN - specify
--stdin
if you want to read the STDIN for processing. ## store the reads that fail the filters - give
--failed_out
to specify the file name to store the failed reads. - if one read failed and is written to
--failed_out
, itsfailure reason
will be appended to its read name. For example,failed_quality_filter
,failed_too_short
etc. ## process only part of the data If you don't want to process all the data, you can specify--reads_to_process
to limit the reads to be processed. This is useful if you want to have a fast preview of the data quality, or you want to create a subset of the filtered data. ## do not overwrite exiting files You can enable the option--dont_overwrite
to protect the existing files not to be overwritten byfastplong
. In this case,fastplong
will report an error and quit if it finds any of the output files (read, json report, html report) already exists before.
filtering
Multiple filters have been implemented.
quality filter
Quality filtering is enabled by default, but you can disable it by -Q
or disable_quality_filtering
. Currently it supports filtering by limiting the N base number (-n, --n_base_limit
), and the percentage of unqualified bases.
To filter reads by its percentage of unqualified bases, two options should be provided:
-q, --qualified_quality_phred
the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified.-u, --unqualified_percent_limit
how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40%
You can also filter reads by its average quality score
-m, --mean_qual
if one read's average quality score <avg_qual, then this read is discarded. Default 0 means no requirement (int [=0])
length filter
Length filtering is enabled by default, but you can disable it by -L
or --disable_length_filtering
. The minimum length requirement is specified with -l
or --length_required
.
You can specify --length_limit
to discard the reads longer than length_limit
. The default value 0 means no limitation.
Other filter
New filters are being implemented. If you have a new idea or new request, please file an issue.
adapters
fastplong
trims adapter in both read start and read end. Adapter trimming is enabled by default, but you can disable it by -A
or --disable_adapter_trimming
.
fastplong -i in.fq -o out.fq -s AAGGATTCATTCCCACGGTAACAC -e GTGTTACCGTGGGAATGAATCCTT
If the adapter sequences are known, it's recommended to specify
-s, --start_adapter
for read start adapter sequence, and-e, --end_adapter
for read end adapter sequence as well.If
--end_adapter
is not specified but--start_adapter
is specified, then fastplong will use the reverse complement sequence ofstart_adapter
to beend_adapter
.You can also specify
-a, --adapter_fasta
to give a FASTA file to tellfastplong
to trim multiple adapters in this FASTA file. Here is a sample of such adapter FASTA file:>Adapter 1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCA >Adapter 2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT >polyA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
The adapter sequence in the FASTA file should be at least 6bp long, otherwise it will be skipped. And you can give whatever you want to trim, rather than regular sequencing adapters (i.e. polyA).
If all these adapter options (
start_adapter
,end_adapter
andadapter_fasta
) are not specified,fastplong
will try to detect the read start and read end adapters automatically. The detected adapter sequences may be a bit shorter or longer than the real ones. And there is a certain probability of misidentification, especially when most reads don't have adapters (it won't cause too bad result in this case).fastplong calculates edit distance when detecting adapters. You can specify the
-d, --distance_threshold
to adjust the mismatch tolerance of adapter comparing. The default value is 0.25, which means allowing 25% mismatch ratio (i.e. allow 10 distance for 40bp adapter). Suggest to increase this value when the data is much noisy (high error rate), and decrease this value when the data is with high quality (low error rate).to make a cleaner trimming, fastplong will trim a little more bases connected to the adapters. This option can be specified by
--trimming_extension
, with a default value of 10.
per read cutting by quality score
fastplong
supports per read sliding window cutting by evaluating the mean quality scores in the sliding window. fastplong
supports 2 different operations, and you enable one or both:
-5, --cut_front
move a sliding window from front (5') to tail, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise. Default is disabled. The leading N bases are also trimmed. Usecut_front_window_size
to set the widnow size, andcut_front_mean_quality
to set the mean quality threshold. If the window size is 1, this is similar as the TrimmomaticLEADING
method.-3, --cut_tail
move a sliding window from tail (3') to front, drop the bases in the window if its mean quality is below cut_mean_quality, stop otherwise. Default is disabled. The trailing N bases are also trimmed. Usecut_tail_window_size
to set the widnow size, andcut_tail_mean_quality
to set the mean quality threshold. If the window size is 1, this is similar as the TrimmomaticTRAILING
method.
If you don't set window size and mean quality threshold for these function respectively, fastplong
will use the values from -W, --cut_window_size
and -M, --cut_mean_quality
global trimming
fastplong
supports global trimming, which means trim all reads in the front or the tail. This function is useful since sometimes you want to drop some cycles of a sequencing run.
For example, the last cycle is uaually with low quality, and it can be dropped with -t 1
or --trim_tail=1
option.
- The front/tail trimming settings are given with
-f, --trim_front
and-t, --trim_tail
.
Thanks for making a version of
fastp
for long reads.While running some nanopore fastq data through the tool I noticed that the Q scores reported by
fastplong
appear to be significantly different than those reported byPycoQC
(which parses thesequencing summary
report produced during the run). In fact they appear to be 2x (in the normal range of Q0-45 withfastplong
compared to Q0-25 withPycoQC
). Any idea why?That is strange.
The
fastplong
quality score statistics module is directly imported fromfastp
, and should work well.Can you take a look at the FASTQ file? It can be easily found which one is correct.
Or can you share a piece of the data here?
I think the discrepancy is likely arising from the fact that onboard basecalls were done using FAST-calling to save time and
pycoQC
is looking at those stats. We redo the calls offline withdorado
to get high accuracy, which is whatfastplong
looked at. I will need to runfastplong
with onboard fastq files to verify.Checked the onboard call Q scores against the external call Q scores. The Q scores are different so that explains original observation.
The download page is currently not available, is that possible?
(my wget command just hangs on 'awaiting response' )
The download worked albeit with a significant delay (think dial-up modem speed). Perhaps a temp glitch with the opengene host.
right, will give it another go. thanks
was finally able to download it via windows (after ignoring lots of secure remarks)
the cmdline wget kept on failing
You can install it with bioconda, or compile it from source.
bioconda .... no thanks :-)
I'll give the compile a try.
(though I was already able to get the pre-compiled one, cfr above)
Hey, just a curiosity question. seems like you don't prefer bioconda.
May I ask why is it like that ? I use bioconda so may be you might have noticed something which i didn't feel but will be helpful to me too.
it's just too much overhead ... (too much things I don't need at that moment, too much storage usage ... )