An Existing Tool For Trimming Poly-A/T Tails Containing Single-Nucleotide Errors, Using A Sliding Window Approach?
2
1
Entering edit mode
11.0 years ago
Chronos ▴ 620

There exist multiple tools for trimming read poly-tails, but most of them seem to only look for the continuous stretch of the same nucleotide, and upon encountering a different nucleotide trimming stops. I'm using prinseq-lite.pl for trimming, and here are two graphs from FastQC - before and after tail trimming:

before_tail_trimming after_tail_trimming

As can be seen from the second graph, there is still a significant oligo-A enrichment between positions 50 and 60-75. Manual examination shows that there are simply one-nucleotide errors in the sequences of the poly-A tails, causing trimming to stop.

Does anyone know of a FASTQ read-trimming tool/script (preferrably multi-threaded) which looks for A/T content in a sliding window, and stops trimming when this content falls below a certain threshold? I feel most of the remaining oligo-As would be removed using window length around 10 and threshold around 0.89.

The purpose of this is to preserve as many RNA-Seq reads as possible to use for patching gaps in the prokaryotic genome assembly.

fastq trimming • 9.0k views
ADD COMMENT
0
Entering edit mode
11.0 years ago
SES 8.6k

You probably need to set the window size appropriately with prinseq because the default is 1, which means it stops at the first base where the trimming rules fail.

-trim_qual_window <integer>
        The sliding window size used to calculate quality score by type.
        To stop at the first base that fails the rule defined, use a
        window size of 1. [default: 1]

If you modify the window size and set the threshold:

-lc_method dust -lc_threashold 89

you can likely get closer to what you want. Alternatively, you can create a custom filter, for example:

-custom_params "AT 50%"

There are also options specifically for trimming poly-A/T regions:

-trim_tail_left <integer>
        Trim poly-A/T tail with a minimum length of trim_tail_left at
        the 5'-end.

-trim_tail_right <integer>
        Trim poly-A/T tail with a minimum length of trim_tail_right at
        the 3'-end.

It's hard to give an exact command for this type of task. Likely, you'll need to experiment with these options to find what you need. Also, try using the prinseq graphs to look at duplication across your reads. I'm not sure those FastQC summaries take all your reads into account, and using another method may be informative (I find those prinseq summaries to be very helpful).

ADD COMMENT
0
Entering edit mode

AFAIK,

  • trim_qual_window is only effective for quality trimming, not for poly-A/T trimming;
  • low complexity filters are only used to filter away (remove) reads, not to trim them;
  • custom_params is indeed an interesting option, but it is also listed in the filtering section, so I think it will discard reads instead of trimming them.

I wish I were wrong about the items above :)

Indeed, FastQC only examines the first 10k or 100k reads. It does evaluate reads duplication, though. Assuming the top of the FASTQ file is a "random sample", this partial analysis should be quite close to the results from the full file.

ADD REPLY
1
Entering edit mode

You are correct, the low complexity filters are used for filtering, but the -trim_tail_left and -trim_tail_right options will do poly-A/T trimming. In the order of processing, quality trimming and poly-A/T trimming will be done, then low complexity and duplicate filtering.

ADD REPLY
0
Entering edit mode

Right, trim_tail_left and trim_tail_right are the options I've used to get the second FastQC graph (the one after tail trimming).

Assuming there's no existing solution, I've written a multi-threaded script to perform the desired sliding-window trimming. I'll test it tomorrow morning at work and will then share.

ADD REPLY
0
Entering edit mode
11.0 years ago
Chronos ▴ 620

I've written the desired trimmer: https://bitbucket.org/qmentis/bioinformatics-scripts/src/566213e3da064e793857a7e06d9e77a43ec7ee28/sliding-window-polyA-trimmer.py?at=master

On my PC, it processes 1m reads in a little under 20 seconds (10m reads in 3 minutes).

Notes/bugs/features:

  • it hasn't been thoroughly tested ("seems to work"), so use at your own risk;
  • it only trims the righthandside tails;
  • there is no support for paired reads;
  • reads order is only preserved when running with --cpus 1;
  • it may trim 1 extra bp of the non-polyA sequence (with default options; custom options may trim up to cutoff bps);
  • it does not really scale well to multiple cores/CPUs.

Here's the FastQC graph after applying this script (window = 10, cutoff = 2) to the results of prinseq-lite (2nd graph from the question).

after_trimming_with_the_script

Here's the relative enrichment of the 'AAAAA' 5-mer before (1st line) and after (2nd line) the sliding window poly-A trimming:

Sequence, Count, Obs/Exp Overall, Obs/Exp Max, Max Obs/Exp Position

AAAAA, 2814030, 18.359743, 89.11679, 50-54

AAAAA, 1383040, 9.538634, 74.362335, 50-54

ADD COMMENT

Login before adding your answer.

Traffic: 1812 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6