Hi All,
I would appreciate some feedback on a curious issue I am running into.
I started to analyze what Fastx was doing with the adapter clipping fastx_clipper v 0.0.13.2 with command line. I've spent most of the day examining the original reads to clipped reads. Most of the pre-clipped reads don't align well with the adapters. Which means it is cutting out data in a weird way and may be introducing biases. I would think this would be important to avoid when wanting to study expression data. It also appears that unless the adapter sequence starts with the 5' end of the adapter (i.e. a random chunk of adapter at the 3' end of the actual sequence) it won't clip that off.
I should mention too, I am aware that Fastx only clips from the 3' end of the sequence, these problems I am reporting are not an issue with confusion in that regard.
Galaxy uses this package for adapter clipping. This worries me because many people are using this tool.
Has anyone else on this forum had these sorts of problems? More importantly, can anyone recommend a tool to trim off adapters that they have extensively vetted?
I think it would help if you also included an example of pre/post clipping of reads that show the inconsistency. That being said I would not be surprised if the fastx toolkit had certain limitations. That's why there are so many alternatives.
The adapter I was trying to clip was the standard universal adapter and the indexed adapters. I just received a response from Gordon, see below:
"The fastx-clipper was designed to work with short reads (e.g 36nt or 50nt), and be very-sensitive (and somewhat less specific) - it will not perform well with longer reads (your 101nt FASTA files).
I'd recommend trying other clipping programs (e.g. "cutadapt")."
Without some examples, it's hard to get an idea of what problems you are having, however one problem I had that once caught me off guard was due to adapters whose 5' end was rather homogeneous and I was clipping with the default fastx_clipper parameters.
A simplified explanation of what bit me:
Imagine that the adapters were AAAAAANNNNNNNNN or so. Now, if a read was A-rich in the middle (but not due to adapter contamination), fastx_clipper would find a strong match of the 5' end of the adapter to this region and clip the rest of the read off.
I had initially blamed the funky results I was getting to the protocol, and by chance I later processed the same data using cutadapt (again, with the default parameters) and noticed the artifact that I noticed in my results was gone.
As already mentioned, it's hard to understand what's going on without more concrete examples from you, but I thought I'd provide this example here in the off chance that it might help.
I recently compared the different programs to remove adaptors (illumina) for 75bp and 150 bp reads and trimmomatic performed the best for me. I measured performance with FASTQC and considered the ability of the programs to run in parallel. I compared fastx, trimmomatic and NGSQCToolkit_v2.3.
this is how I run it,
java -jar ~/tools/Trimmomatic-0.27/trimmomatic-0.27.jar SE -threads 1 -phred33 <input> <outputfile> ILLUMINACLIP:adaptor.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:30
and this is my adaptor file
adaptor
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
Trimmomatic also provides some files with the illumina adaptors. In my case my adaptor file was good enough to remove adaptor contamination which was small (less than 1%).
I think it would help if you also included an example of pre/post clipping of reads that show the inconsistency. That being said I would not be surprised if the fastx toolkit had certain limitations. That's why there are so many alternatives.