Additional sequences in adapter file cause trimming problem?
1
0
Entering edit mode
7.4 years ago
Anand Rao ▴ 640

I have a simple question: Will inclusion of adapter sequences NOT used in library prep have detrimental effects, if included in the adapter reference sequence file, during adapter trimming?

My question is based on the following observations of using 2 separate adapter sequences files, using Trimmomatic. The 2 different adapter sequence file details are as follows: file 1 - adapters.fa - see http://textuploader.com/d6qwq file 2 - PC_adapters.fa - see http://textuploader.com/d6qwg

and everything else remaining the same in terms of the common Trimmomatic syntax as shown below:

java -jar /home/aksrao/Trimmomatic/Trimmomatic-0.36/trimmomatic-0.36.jar PE  -threads 8 -phred33 -trimlog EthFoc-11_PCadapters_Trimmomatic2.log EthFoc-11.S285_L007.1.txt  EthFoc-11.S285_L007.2.txt -baseout EthFoc11_PC_Trimmomatic.fq ILLUMINACLIP:/home/aksrao/Trimmomatic/Trimmomatic-0.36/adapters/adapters.fa:1:30:10 LEADING:30 TRAILING:30 SLIDINGWINDOW:24:30 MINLEN:72

versus

java -jar /home/aksrao/Trimmomatic/Trimmomatic-0.36/trimmomatic-0.36.jar PE  -threads 8 -phred33 -trimlog EthFoc-11_PC_Trimmomatic.log EthFoc-11.S285_L007.1.txt  EthFoc-11.S285_L007.2.txt -baseout EthFoc11_PC_Trimmomatic.fq ILLUMINACLIP:/home/aksrao/Trimmomatic/Trimmomatic-0.36/adapters/PC_adapters.fa:1:30:10 LEADING:30 TRAILING:30 SLIDINGWINDOW:24:30 MINLEN:72

you can see, the Trimmomatic results are quite different.

"ILLUMINACLIP: Using 2 prefix pairs, 17 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Input Read Pairs: 16596362 Both Surviving: 10711318 (64.54%) Forward Only Surviving: 4456245 (26.85%) Reverse Only Surviving: 148485 (0.89%) Dropped: 1280314 (7.71%) TrimmomaticPE: Completed successfully"

versus

"ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Input Read Pairs: 16596362 Both Surviving: 14158405 (85.31%) Forward Only Surviving: 1010700 (6.09%) Reverse Only Surviving: 247697 (1.49%) Dropped: 1179560 (7.11%) TrimmomaticPE: Completed successfully"

Now, moving on to comparison between these 2 adapter files, but using bbduk, from BBtools (sorry genomax, forgot to include this earlier). First, using file 1 - adapters.fa

java -Djava.library.path=/share/apps/bbmap-36-67/jni/ -ea -Xmx26322m -Xms26322m -cp /share/apps/bbmap-36-67/current/ jgi.BBDukF in1=EthFoc-11.S285_L007.1.txt in2=EthFoc-11.S285_L007.2.txt out1=EthFoc11_bbduk_2adapters_1.fq out2=EthFoc11_bbduk_2adapters_2.fq outm1=EthFoc11_bbduk_2adapters_1_MATCH.fq outm2=EthFoc11_bbduk_2adapters_2_MATCH.fq ref=adapters.fa ktrim=r k=21 rcomp=f mink=9 hdist=1 hdist2=0 tpe tbo ftm=5 qtrim=rl trimq=20 maq=20 minlen=41 stats=bbduk_TileFilter_AdaptTrim_QCtrim_QCfilter_stats.txt

Executing jgi.BBDukF [in1=EthFoc-11.S285_L007.1.txt, in2=EthFoc-11.S285_L007.2.txt, out1=EthFoc11_bbduk_2adapters_1.fq, out2=EthFoc11_bbduk_2adapters_2.fq, outm1=EthFoc11_bbduk_2adapters_1_MATCH.fq, outm2=EthFoc11_bbduk_2adapters_2_MATCH.fq, ref=adapters.fa, ktrim=r, k=21, rcomp=f, mink=9, hdist=1, hdist2=0, tpe, tbo, ftm=5, qtrim=rl, trimq=20, maq=20, minlen=41, stats=bbduk_TileFilter_AdaptTrim_QCtrim_QCfilter_stats.txt]

BBDuk version 36.67
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=26450m, free=25898m, used=552m

Added 17687 kmers; time:    0.116 seconds.
Memory: max=26450m, free=24656m, used=1794m

Input is being processed as paired
Started output streams: 0.013 seconds.
Processing time:        254.755 seconds.

Input:                      33192724 reads      4978908600 bases.
QTrimmed:                   8373860 reads (25.23%)  421643105 bases (8.47%)
FTrimmed:                   0 reads (0.00%)     0 bases (0.00%)
KTrimmed:                   8993356 reads (27.09%)  460954056 bases (9.26%)
Trimmed by overlap:         1040082 reads (3.13%)   4894880 bases (0.10%)
Low quality discards:       0 reads (0.00%)     0 bases (0.00%)
Total Removed:              2110362 reads (6.36%)   887492041 bases (17.83%)
Result:                     31082362 reads (93.64%)     4091416559 bases (82.17%)

Time:               254.922 seconds.
Reads Processed:      33192k    130.21k reads/sec
Bases Processed:       4978m    19.53m bases/sec

and file1=adapter.fa run's corresponsing stats file (you can see there are matches to adapter sequences, other than PCR_primer1_rc and PE2_rc, that were NOT used for library prep):

#File   EthFoc-11.S285_L007.1.txt   EthFoc-11.S285_L007.2.txt
#Total  33192724
#Matched    8843979 26.64433%
#Name   Reads   ReadsPct
PCR_Primer1_rc  4729074 14.24732%
PE2_rc  4091410 12.32623%
PCR_Primer2_rc  14043   0.04231%
FlowCell1   5741    0.01730%
TruSeq2_PE_f    622 0.00187%
PE1_rc  598 0.00180%
TruSeq2_SE  521 0.00157%
Trans1_rc   344 0.00104%
Trans1  338 0.00102%
FlowCell2   234 0.00070%
PrefixPE/1  225 0.00068%
PrefixPE/2  218 0.00066%
PrefixPE/2  159 0.00048%
TruSeq2_PE_r    135 0.00041%
Trans2  132 0.00040%
PrefixPE/1  98  0.00030%
Trans2_rc   87  0.00026%

Next, using file 2 - PC_adapters.fa

java -Djava.library.path=/share/apps/bbmap-36-67/jni/ -ea -Xmx26246m -Xms26246m -cp /share/apps/bbmap-36-67/current/ jgi.BBDukF in1=EthFoc-11.S285_L007.1.txt in2=EthFoc-11.S285_L007.2.txt out1=EthFoc11_bbduk_ALLadapters_1.fq out2=EthFoc11_bbduk_ALLadapters_2.fq outm1=EthFoc11_bbduk_ALLadapters_1_MATCH.fq outm2=EthFoc11_bbduk_ALLadapters_2_MATCH.fq ref=PC_adapters.fa ktrim=r k=21 rcomp=f mink=9 hdist=1 hdist2=0 tpe tbo ftm=5 qtrim=rl trimq=20 maq=20 minlen=41 stats=bbduk_TileFilter_AdaptTrim_QCtrim_QCfilter_stats2.txt overwrite=TRUE
Executing jgi.BBDukF [in1=EthFoc-11.S285_L007.1.txt, in2=EthFoc-11.S285_L007.2.txt, out1=EthFoc11_bbduk_ALLadapters_1.fq, out2=EthFoc11_bbduk_ALLadapters_2.fq, outm1=EthFoc11_bbduk_ALLadapters_1_MATCH.fq, outm2=EthFoc11_bbduk_ALLadapters_2_MATCH.fq, ref=PC_adapters.fa, ktrim=r, k=21, rcomp=f, mink=9, hdist=1, hdist2=0, tpe, tbo, ftm=5, qtrim=rl, trimq=20, maq=20, minlen=41, stats=bbduk_TileFilter_AdaptTrim_QCtrim_QCfilter_stats2.txt, overwrite=TRUE]

BBDuk version 36.67
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=26374m, free=25824m, used=550m

Added 3371 kmers; time:     0.144 seconds.
Memory: max=26374m, free=24585m, used=1789m

Input is being processed as paired
Started output streams: 1.648 seconds.
Processing time:        94.316 seconds.

Input:                      33192724 reads      4978908600 bases.
QTrimmed:                   8375144 reads (25.23%)  421831220 bases (8.47%)
FTrimmed:                   0 reads (0.00%)     0 bases (0.00%)
KTrimmed:                   8976040 reads (27.04%)  460682762 bases (9.25%)
Trimmed by overlap:         1040268 reads (3.13%)   4897764 bases (0.10%)
Low quality discards:       0 reads (0.00%)     0 bases (0.00%)
Total Removed:              2110256 reads (6.36%)   887411746 bases (17.82%)
Result:                     31082468 reads (93.64%)     4091496854 bases (82.18%)

Time:               96.121 seconds.
Reads Processed:      33192k    345.32k reads/sec
Bases Processed:       4978m    51.80m bases/sec

and this run's stats summary, using only 2 adapter sequences.

#File   EthFoc-11.S285_L007.1.txt   EthFoc-11.S285_L007.2.txt
#Total  33192724
#Matched    8820559 26.57377%
#Name   Reads   ReadsPct
TruSeq_Universal_Adapter_rc 4729103 14.24741%
TruSeq_Indexed_Primer   4091456 12.32636%

Which brings me back to the question whether casual inclusion of additional adapter sequences in the reference file, that were not used during library prep, can lead to spurious cases of trimming that will result in output file that is an artifact where both correct trimming AND additional, unnecessary, unwanted trimming has occurred?

FYI, I am looking at HiSeq 4000 generated data, from haploid fungal spores, generated using the Kapa Hyper Plus Library Prep Kit - https://www.kapabiosystems.com/assets/KAPA_Hyper_Plus_Kit_TDS_021715.pdf.

Thanks!

adapter trimming trimmomatic bbtools • 3.4k views
ADD COMMENT
0
Entering edit mode

I can't say anything about trimmomatic since I don't use it but with bbduk we use a single file with most commonly used commercial kit adapter sequences that @Brian includes. Have you tried using bbduk? With tbe tpo flags you can remove even single base contaminants from adapters in PE reads.

ADD REPLY
0
Entering edit mode

genomax - thanks for pointing that out, original post updated with bbduk results as well. You may respond to my comment below, if you wish. Thanks!

ADD REPLY
1
Entering edit mode

Have you checked to see how many optical duplicates you have in this dataset considering this is a 4K data (clumpify is your program there from BBMap). It is possible that some of these reads may be removed (if you eliminate optical dups) and then the trimming results would look more palatable.

ADD REPLY
0
Entering edit mode

I have to read up to make sure I understand "optical duplicates", did not know there was such a thing until you posted - thanks for raising this point.

If I understand what you saying, you recommend a quick clumpify run to empirically determine if my HiSeq4000 PE data has a lot of optical duplicates, right? Which is known to happen more with the new HiSeq3000/4000 platforms, than with earlier Illumina platforms - right?

So then, I'd use bbmerge.sh on my R1 & R2, then run clumpify to see what it reports, yes? I would not need to use bbsplitpairs.sh yet, correct?

ADD REPLY
0
Entering edit mode

Yes to questions in second para. Read about the optical dups and index misassignments here, here and here. It may be best to run clumpify on the original files. Is there a reason you are merging the two reads? Does ALLPATH needs the reads to be merged?

ADD REPLY
0
Entering edit mode

Thanks for those links!

May be merge is not the right technical, but interleaved. Gotta check if bbmerge does interleaving. This step appears necessary because clumpify.sh --help says

Note - Paired reads should be interleaved for both input and output, not split into two files.

ADD REPLY
1
Entering edit mode

reformat.sh does the interleaving but you do not need to have reads interleaved for clumpify. That was an old requirement.

ADD REPLY
0
Entering edit mode

As Genomax said, that was an old requirement, so please download the latest version. As for BBMerge, optical duplicate detection should be done on the raw (or adapter-trimmed) files, not merged reads (which generally works but will make the process unnecessarily complicated with a merged and unmerged dataset).

ADD REPLY
5
Entering edit mode
7.4 years ago

You will get optimal trimming results by only including adapter sequences that are actually being used. But often you don't know the sequences being used, so if you want to develop a universally applicable pipeline to use on various libraries, it's much easier to use all possible adapters. The difference in specificity (with BBDuk) tends to be very slight between just using the correct pair of adapters and all common adapters, as long as the adapter sequences (and K) are sufficiently long.

ADD COMMENT
0
Entering edit mode

Your description of behavior of bbduk matches my empirical observations using bbduk, now added to my updated post. Would you concur such is indeed the case?

I prefer correctness over ease, unless the tradeoff is horrible. In my case, based on my bbduk results, I am very inclined to use ONLY 2 primers TruSeq_Universal_Adapter_rc = PCR_Primer1_rc and TruSeq_Indexed_Primer = PE2_rc for my trimming steps.

And based on my Trimmomatic results, it appears that difference of 65% vs. 85% in "both surviving reads" is massive.

So, I feel gratuitous use of adapter sequence in the reference file while adapter trimming is likely to be injurious to downstream processes. This is not a criticism of your advice to cast a wider net at first, but I feel a tighter filter is warranted once the identities of these adapter sequences are 'discovered'.

What do you opine?

ADD REPLY
1
Entering edit mode

Your data matches my experience; there's a very slight over-trimming incurred by including unrelated adapter sequences.

I would certainly recommend, as best practice, that you only use the specific adapter sequences actually in your library, if you know them. The difference is only 0.01% of bases in this case, but we don't know the distribution so it could theoretically impact downstream analysis (though I suspect that would be extremely rare). A more substantial advantage is that with a smaller list of adapter sequences, you can more safely increase sensitivity (e.g. by increasing hdist to 2 or reducing K) when you need to for a low-quality library.

The biggest problem with this approach is automating the process of correctly choosing only the actual adapter sequences in a robust fashion, but as long as you are performing the steps manually, that's immaterial.

I have no idea what's going on with Trimmomatic, though. I wonder if it is reducing the length of all adapter sequences to match the shortest one in the list, or something similar, such that including a single short adapter in the list causes problems.

ADD REPLY
0
Entering edit mode

Yeah, I am not sure about the reason(s) behind Trimmomatic 85% vs 65% retention with different adapter sequences.

As such, I am happy with the logical consistency of results from bbduk, the ease with which I can understand the results, and alacrity of the software author, i.e. you, in helping out. So apart from recreating results by my colleague, I think I wont be using Trimmomatic, since I now much prefer bbduk over other trimming tools... Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6