There exist multiple tools for trimming read poly-tails, but most of them seem to only look for the continuous stretch of the same nucleotide, and upon encountering a different nucleotide trimming stops. I'm using prinseq-lite.pl for trimming, and here are two graphs from FastQC - before and after tail trimming:
As can be seen from the second graph, there is still a significant oligo-A enrichment between positions 50 and 60-75. Manual examination shows that there are simply one-nucleotide errors in the sequences of the poly-A tails, causing trimming to stop.
Does anyone know of a FASTQ read-trimming tool/script (preferrably multi-threaded) which looks for A/T content in a sliding window, and stops trimming when this content falls below a certain threshold? I feel most of the remaining oligo-As would be removed using window length around 10 and threshold around 0.89.
The purpose of this is to preserve as many RNA-Seq reads as possible to use for patching gaps in the prokaryotic genome assembly.
AFAIK,
I wish I were wrong about the items above :)
Indeed, FastQC only examines the first 10k or 100k reads. It does evaluate reads duplication, though. Assuming the top of the FASTQ file is a "random sample", this partial analysis should be quite close to the results from the full file.
You are correct, the low complexity filters are used for filtering, but the
-trim_tail_left
and-trim_tail_right
options will do poly-A/T trimming. In the order of processing, quality trimming and poly-A/T trimming will be done, then low complexity and duplicate filtering.Right,
trim_tail_left
andtrim_tail_right
are the options I've used to get the second FastQC graph (the one after tail trimming).Assuming there's no existing solution, I've written a multi-threaded script to perform the desired sliding-window trimming. I'll test it tomorrow morning at work and will then share.