Hi all,
Usually get on well with Trimmomatic but am having issues removing adapters from some paired-end RNA-seq data. Here's my process so far:
1) Put together a PE adapter fasta for the adapters described by the sequencing company; ran trimmomatic; no reads dropped.
2) Ran Trim_galore! to automatically detect adapters. It detected the Illumina sequence AGATCGGAAGAGC
in ~30% of reads.
3) I grepped for this sequence in the raw FASTQs (and used flag -B 1
to also retrieve the metadata line). Sure enough, I could see this sequence in the grepped fastq entries, plus 20 bp following that were different for Read 1's vs Read 2's. (Images 1 and 3 respectively, see highlighted regions for an example.)
4) I searched the Illumina adapters PDF for these sequences, and they turn out to be a perfect match for the "IDT for Illumina TruSeq DNA and RNA UD Indexes" (see image 2). I then confirmed with the sequencing company that this was what they had in fact used - yes, the ones they initially sent me were wrong. Great, back to trimmomatic then!
5) Ran trimmomatic looking for these adapters (which I had confirmed with my own eyes were present in at least a handful of reads - hopefully you can confirm this for yourself using the screenshots provided). Zero reads dropped. I tried every combination of the adapter sequences I could think of, including complements with non-complements, read 2 adapters taking the place of read 1 adapters, etc, etc. Still zero reads dropped. (See below for a list of all pairs of adapter sequences I submitted to trimmomatic.)
What am I doing wrong? I will likely just accept the results of trim_galore and move on, but this is driving me crazy as I can't see any reason why it isn't working.
ADAPTER SEQUENCES TRIALLED: (formatting being mucked around by BioStars, the "Prefix.../1" was in fact on a separate line to the nucleotides and there were angle brackets (greater-than-signs) at the start of each "Prefix..." line.)
>PrefixIllumina1/1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>PrefixIllumina1/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>PrefixIllumina2/1
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>PrefixIllumina2/2
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>PrefixIllumina3/1
TCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGT
>PrefixIllumina3/2
TCTAGCCTTCTCGCAGCACATCCCTTTCTCACA
>PrefixIllumina4/1
TCTAGCCTTCTCGCAGCACATCCCTTTCTCACA
>PrefixIllumina4/2
TCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGT
>PrefixIlluminaOne/1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>PrefixIlluminaOne/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>PrefixIlluminaTwo/1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>PrefixIlluminaTwo/2
TCTAGCCTTCTCGCAGCACATCCCTTTCTCACA
>PrefixIlluminaThree/1
TCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGT
>PrefixIlluminaThree/2
TCTAGCCTTCTCGCAGCACATCCCTTTCTCACA
>PrefixIlluminaFour/1
TCTAGCCTTCTCGTGTGCAGACTTGAGGTCAGT
>PrefixIlluminaFour/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT`
If you are willing to try an alternate program I suggest using
bbduk
from BBMap suite orfastp
. A guide to use bbduk is available.Thank you GenoMax, I may well give it a try if this banging-head-against-wall issue persists!
Simply use the
adapters.fa
file provided in the bbmap inresource
directory as source for adapters.Hi all,
Thanks so much for all your help. I've been doing some more testing and have figured out a few things / learned them from your posts: 1) Originally trimmomatic wasn't reporting any dropped reads because none should have been dropped: they were trimmed, but were still >0 bp so were retained 2) However, even when I included a MINLEN:150 function in trimmomatic (which should make dropped reads = trimmed reads), I was still able to grep out many tens of thousands of reads with obvious adapter contamination. 3) GenoMax was able to remove these using bbduk, while Istvan was able to remove these using a single simple adapter file in trimmomatic 4) I have since also been able to remove these using trimmomatic in palindromic mode. I read in the trimmomatic manual that "adapter sequences are 'in silico ligated' onto the start of the reads" for palindrome mode. As a result, one has to provide trimmomatic with the reverse complement of the adapter contamination sequences that one can see in the raw fastq. This is backed up by comparing the adapter fastas on trimmomatic's GitHub page with those found in the Illumina adapters PDF - they are indeed reverse complements of one another. 5) So, I now have trimmomatic palindromic trimming working. For thoroughness, I also ran
trim_galore
,fastp
, andbbduk
... I assumed they would all come out with vaguely similar answers now. Absolutely not...So one problem down and another has cropped up. What could be causing these drastic differences?
Supplied or detected adapters for each software:
Important details:
/
in the adapter name!In general, for standard Illumina adapters usually, it is sufficient to list the beginning of the adapter sequence, it will trim the rest accordingly:
if you use this adapter then all tools should produce similar results (when invoked to operated in a similar fashion)
Hi Istvan,
Thanks for your time on all parts of this query. Your advice above is really appreciated, I'll make sure in particular to avoid mixing different adapter sequences when comparing tools.
To make this a fair comparison you need to supply the same target sequence for all software so you may want to re-do all results using the trimmomatic oligos.
BBduk (and other programs) have many options that can affect end-result e.g. if you have paired end data you need to use
tbo tpe
options to remove contaminants as small as one bp. There are default values for all parameters that may not always be appropriate but are going to be applied when you run the program. This can affect end results.I have a feeling you are making this more complicated that it really is.
As you may have realized by now, there is a core sequence that is common for all illumina multiplexed adapters. Once trimming programs find this sequence they will trim all sequence to the 3'-end of that core.
With bbduk use the following command:
OR
if you want to provide adapter sequences
Hi GenoMax,
You're right, I'm realising there are a lot of other parameters and filtering steps 'under the hood' that are obfuscating and invalidating the results of my tests.
With your suggested code here I got a much more sensible answer from bbduk. It seems the k=N parameter has quite a large effect on the result!
Still unsure whether I'll go for fastp or bbduk in the end, but I'm a little disillusioned with trimmomatic now!
Provide complete command line you are using.
Trimmomatic not called using usual Java command as it has been set up on my institute's cluster to be called like a bash command. Have previously used this installation of trimmomatic and has worked a charm. It's containerised in Singularity so cannot have been tampered with since I last used it. ("illumina_adapters2.fa" contained the first 8 adapters (4 pairs) listed above, I also ran this command again using a second file called "illumina_adapters3.fa" that contained the remaining adapter pairs listed above.)
Sorry for the awful formatting... I can't seem to get code to wrap nicely on here.
Use the
101010
button after highlighting text you want to format as code. I have done it for you now.This was the output from trimmomatic: