Trimmomatic: Trimming only custom adapter sequences
1
6
Entering edit mode
10.3 years ago

Hi,

I am attempting to trim Illumina RNA-seq data (paired-end) I downloaded from NCBI in SRA format. I have converted the .sra file into two .fastq using fastQ-dump. I then FastQC'd both files, which indicated there were adapters/primers present:

#Sequence       Count   Percentage      Possible Source
CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCG      1108938 10.945848187304692      Illumina Paired End PCR Primer 2 (100% over 31bp)
CAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACC      85099   0.839975485456754       Illumina Paired End PCR Primer 2 (100% over 36bp)
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGT      29724   0.2933927699469625      Illumina Paired End PCR Primer 2 (100% over 50bp)
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGT      23964   0.23653829696571824     Illumina Paired End PCR Primer 2 (98% over 50bp)
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAG      10698   0.10559533888079009     Illumina Paired End PCR Primer 2 (97% over 36bp)

I have been told to use Trimmomatic, but I am struggling getting it to work. Could someone please guide as to how to only remove the above sequences (no quality trimming etc)? i.e Only run the ILLUMINACLIP process and how to specify a custom list of adapters/primers to be trimmed.

Also, will Trimmomatic work on 454 data? If not, what would be a suitable alternative?

This was my previous attempt:

java -jar /exports/software/trimmomatic/trimmomatic-0.32/trimmomatic-0.32.jar PE \
  -threads 24 \
  -phred33 \
  -trimlog trimlog.txt \
  SRR1505105_1.fastq SRR1505105_2.fastq \
  output1P.fa output1U.fa output2P.fa output2U.fa \
  ILLUMINACLIP:~/lewis_adapters.fa

This did not work and produced the following error:

TrimmomaticPE: Started with arguments: -threads 24 -phred33 -trimlog trimlog.txt SRR1505105_1.fastq SRR1505105_2.fastq output1P.fa output1U.fa output2P.fa output2U.fa ILLUMINACLIP:~/lstevens_adaptors.fa
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at org.usadellab.trimmomatic.trim.IlluminaClippingTrimmer.makeIlluminaClippingTrimmer(IlluminaClippingTrimmer.java:53)
    at org.usadellab.trimmomatic.trim.TrimmerFactory.makeTrimmer(TrimmerFactory.java:27)
    at org.usadellab.trimmomatic.TrimmomaticPE.run(TrimmomaticPE.java:495)
    at org.usadellab.trimmomatic.Trimmomatic.main(Trimmomatic.java:35)

Many thanks,
Lewis

next-gen sequencing RNA-Seq • 29k views
ADD COMMENT
0
Entering edit mode

Hello to anyone new to bioinformatics who is reading this to try and solve a problem trimming with trimmomatic. I wish I had read this post a week ago and saved myself many headaches. Please save yourself time and go use trim-galore (which uses cutadapt as part of the process) or just cutadapt (or another alternative like bbtrim, which is supposed to work well too). I did finally get trimmomatic to work (many hours and finally help from an IT person). It was a big waste because it runs VERY slowly (even on a high powered compute system) and it did not give consistent output. A lot of the adapters were not removed when I got the reads and ran them on fastqc afterwards.

I then ran everything using trim-galore (installed with miniconda 3 from the bioconda channel) and got VERY fast and very clean reads that were also trimmed properly for quality and already run through fastqc (you can choose that option easily). I also found out it is easy to add an additional unique adapter sequence to trim out at the same time, and I ran trim-galore on a different dataset (produced by a flavor of radseq) and was able to get rid of the regular illumina adapters and the unique adapter.

ADD REPLY
12
Entering edit mode
10.3 years ago
Dan D 7.4k

If you've already made a custom FASTA file for your adapters, can you post it, or a portion of it?

Another question: do you get any output at all from the command you supplied?

Your command looks OK as far as I can tell. Perhaps it doesn't like your supplied FASTA?

From the documentation:

ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>

  • fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used. See below.
    • seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed
  • palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
  • simpleClipThreshold: specifies how accurate the match between any adapter etc. sequence must be against a read.

and here's the relevant portion from the "see below" reference:

To make a custom version of fasta, you must first understand how it will be used. Trimmomatic uses two strategies for adapter trimming: Palindrome and Simple

With 'simple' trimming, each adapter sequence is tested against the reads, and if a sufficiently accurate match is detected, the read is clipped appropriately.

'Palindrome' trimming is specifically designed for the case of 'reading through' a short fragment into the adapter sequence on the other end. In this approach, the appropriate adapter sequences are 'in silico ligated' onto the start of the reads, and the combined adapter+read sequences, forward and reverse are aligned. If they align in a manner which indicates 'read-through', the forward read is clipped and the reverse read dropped (since it contains no new data).

Naming of the sequences indicates how they should be used. For 'Palindrome' clipping, the sequence names should both start with 'Prefix', and end in '/1' for the forward adapter and '/2' for the reverse adapter. All other sequences are checked using 'simple' mode. Sequences with names ending in '/1' or '/2' will be checked only against the forward or reverse read. Sequences not ending in '/1' or '/2' will be checked against both the forward and reverse read. If you want to check for the reverse-complement of a specific sequence, you need to specifically include the reverse-complemented form of the sequence as well, with another name.

Personally, I recommend cutadapt, followed by quality trim with sickle. If you hit a wall with trimmomatic, perhaps cutadapt is worth a look?

EDIT: to answer your latter question, I don't see why trimmomatic wouldn't work with 454 data in FASTQ format, as long as you correctly specify your adapters.

ADD COMMENT
0
Entering edit mode

Hi Deedee,

It was my indeed my supplied FASTA that it didn't like. I managed to get it to work.

Thanks a lot for the in-depth answer! I have since spoke to some colleagues who also recommended cutadapt. Looks like that is the better option.

Thanks,
Lewis

ADD REPLY
0
Entering edit mode

Thanks for the follow-up. I'm glad to hear you got it sorted out.

ADD REPLY
0
Entering edit mode

Can you tell me more how did you fix your FASTA?

ADD REPLY
0
Entering edit mode

Hi,

I think that I am having an issue with my FASTA file too. Running illluminaclip on my data seems to do nothing. What did you do to fix your FASTA file? What was wrong with it?

ADD REPLY
1
Entering edit mode

Hi,

Could you post your trimmomatic command line? Turns out it was not my FASTA file.


Creating the FASTA file should be fairly straightforward. To trim two sequences from unpaired (or unspecifically from paired) data:

>seq1
ACTATATGCCTGACTAGCATC
>seq2
ATCTATTGTGCAGCAGCA

To trim the first sequence from the forward reads only and the second from the reverse only, change this to:

>seq1/1
ACTATATGCCTGACTAGCATC
>seq2/2
ATCTATTGTGCAGCAGCA

The issue with my command line above is because I did not specify parameters after the supplied FASTA.

ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>

e.g. ILLUMINACLIP:adaptor_file.fasta:2:30:10

  • seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed
  • palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
  • simpleClipThreshold: specifies how accurate the match between any adapter etc. sequence must be against a read.

The thresholds used are a simplified log-likelihood approach. Each matching base adds just over 0.6, while each mismatch reduces the alignment score by Q/10. Therefore, a perfect match of a 12 base sequence will score just over 7, while 25 bases are needed to score 15. As such we recommend values between 7 - 15 for this parameter. For palindromic matches, a longer alignment is possible - therefore this threshold can be higher, in the range of 30. The 'seed mismatch' parameter is used to make alignments more efficient, specifying the maximum base mismatch count in the 'seed' (16 bases). Typical values here are 1 or 2.


Hope this helps. Trimmomatic seems to be one of these programs that requires about 10 tries before it runs successfully. The values (2:30:10) are given as examples on the trimmomatic website and have worked well for my data. Slight tweaking might be required depending on your needs.

ADD REPLY
0
Entering edit mode

Following a discussion with my colleagues who have tested these trimming tools against data for which they know the truth, they advise using trimmomatic for quality trimming and cutadapt for adapter trimming. Trimmomatic seems to not fully remove adapters in some cases. I'd advise trying all combinations of approaches on your datasets to work out which is best.

ADD REPLY

Login before adding your answer.

Traffic: 2808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6