I have a simple question: Will inclusion of adapter sequences NOT used in library prep have detrimental effects, if included in the adapter reference sequence file, during adapter trimming?
My question is based on the following observations of using 2 separate adapter sequences files, using Trimmomatic. The 2 different adapter sequence file details are as follows: file 1 - adapters.fa - see http://textuploader.com/d6qwq file 2 - PC_adapters.fa - see http://textuploader.com/d6qwg
and everything else remaining the same in terms of the common Trimmomatic syntax as shown below:
java -jar /home/aksrao/Trimmomatic/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 8 -phred33 -trimlog EthFoc-11_PCadapters_Trimmomatic2.log EthFoc-11.S285_L007.1.txt EthFoc-11.S285_L007.2.txt -baseout EthFoc11_PC_Trimmomatic.fq ILLUMINACLIP:/home/aksrao/Trimmomatic/Trimmomatic-0.36/adapters/adapters.fa:1:30:10 LEADING:30 TRAILING:30 SLIDINGWINDOW:24:30 MINLEN:72
versus
java -jar /home/aksrao/Trimmomatic/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 8 -phred33 -trimlog EthFoc-11_PC_Trimmomatic.log EthFoc-11.S285_L007.1.txt EthFoc-11.S285_L007.2.txt -baseout EthFoc11_PC_Trimmomatic.fq ILLUMINACLIP:/home/aksrao/Trimmomatic/Trimmomatic-0.36/adapters/PC_adapters.fa:1:30:10 LEADING:30 TRAILING:30 SLIDINGWINDOW:24:30 MINLEN:72
you can see, the Trimmomatic results are quite different.
"ILLUMINACLIP: Using 2 prefix pairs, 17 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Input Read Pairs: 16596362 Both Surviving: 10711318 (64.54%) Forward Only Surviving: 4456245 (26.85%) Reverse Only Surviving: 148485 (0.89%) Dropped: 1280314 (7.71%) TrimmomaticPE: Completed successfully"
versus
"ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences Input Read Pairs: 16596362 Both Surviving: 14158405 (85.31%) Forward Only Surviving: 1010700 (6.09%) Reverse Only Surviving: 247697 (1.49%) Dropped: 1179560 (7.11%) TrimmomaticPE: Completed successfully"
Now, moving on to comparison between these 2 adapter files, but using bbduk, from BBtools (sorry genomax, forgot to include this earlier). First, using file 1 - adapters.fa
java -Djava.library.path=/share/apps/bbmap-36-67/jni/ -ea -Xmx26322m -Xms26322m -cp /share/apps/bbmap-36-67/current/ jgi.BBDukF in1=EthFoc-11.S285_L007.1.txt in2=EthFoc-11.S285_L007.2.txt out1=EthFoc11_bbduk_2adapters_1.fq out2=EthFoc11_bbduk_2adapters_2.fq outm1=EthFoc11_bbduk_2adapters_1_MATCH.fq outm2=EthFoc11_bbduk_2adapters_2_MATCH.fq ref=adapters.fa ktrim=r k=21 rcomp=f mink=9 hdist=1 hdist2=0 tpe tbo ftm=5 qtrim=rl trimq=20 maq=20 minlen=41 stats=bbduk_TileFilter_AdaptTrim_QCtrim_QCfilter_stats.txt
Executing jgi.BBDukF [in1=EthFoc-11.S285_L007.1.txt, in2=EthFoc-11.S285_L007.2.txt, out1=EthFoc11_bbduk_2adapters_1.fq, out2=EthFoc11_bbduk_2adapters_2.fq, outm1=EthFoc11_bbduk_2adapters_1_MATCH.fq, outm2=EthFoc11_bbduk_2adapters_2_MATCH.fq, ref=adapters.fa, ktrim=r, k=21, rcomp=f, mink=9, hdist=1, hdist2=0, tpe, tbo, ftm=5, qtrim=rl, trimq=20, maq=20, minlen=41, stats=bbduk_TileFilter_AdaptTrim_QCtrim_QCfilter_stats.txt]
BBDuk version 36.67
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=26450m, free=25898m, used=552m
Added 17687 kmers; time: 0.116 seconds.
Memory: max=26450m, free=24656m, used=1794m
Input is being processed as paired
Started output streams: 0.013 seconds.
Processing time: 254.755 seconds.
Input: 33192724 reads 4978908600 bases.
QTrimmed: 8373860 reads (25.23%) 421643105 bases (8.47%)
FTrimmed: 0 reads (0.00%) 0 bases (0.00%)
KTrimmed: 8993356 reads (27.09%) 460954056 bases (9.26%)
Trimmed by overlap: 1040082 reads (3.13%) 4894880 bases (0.10%)
Low quality discards: 0 reads (0.00%) 0 bases (0.00%)
Total Removed: 2110362 reads (6.36%) 887492041 bases (17.83%)
Result: 31082362 reads (93.64%) 4091416559 bases (82.17%)
Time: 254.922 seconds.
Reads Processed: 33192k 130.21k reads/sec
Bases Processed: 4978m 19.53m bases/sec
and file1=adapter.fa run's corresponsing stats file (you can see there are matches to adapter sequences, other than PCR_primer1_rc and PE2_rc, that were NOT used for library prep):
#File EthFoc-11.S285_L007.1.txt EthFoc-11.S285_L007.2.txt
#Total 33192724
#Matched 8843979 26.64433%
#Name Reads ReadsPct
PCR_Primer1_rc 4729074 14.24732%
PE2_rc 4091410 12.32623%
PCR_Primer2_rc 14043 0.04231%
FlowCell1 5741 0.01730%
TruSeq2_PE_f 622 0.00187%
PE1_rc 598 0.00180%
TruSeq2_SE 521 0.00157%
Trans1_rc 344 0.00104%
Trans1 338 0.00102%
FlowCell2 234 0.00070%
PrefixPE/1 225 0.00068%
PrefixPE/2 218 0.00066%
PrefixPE/2 159 0.00048%
TruSeq2_PE_r 135 0.00041%
Trans2 132 0.00040%
PrefixPE/1 98 0.00030%
Trans2_rc 87 0.00026%
Next, using file 2 - PC_adapters.fa
java -Djava.library.path=/share/apps/bbmap-36-67/jni/ -ea -Xmx26246m -Xms26246m -cp /share/apps/bbmap-36-67/current/ jgi.BBDukF in1=EthFoc-11.S285_L007.1.txt in2=EthFoc-11.S285_L007.2.txt out1=EthFoc11_bbduk_ALLadapters_1.fq out2=EthFoc11_bbduk_ALLadapters_2.fq outm1=EthFoc11_bbduk_ALLadapters_1_MATCH.fq outm2=EthFoc11_bbduk_ALLadapters_2_MATCH.fq ref=PC_adapters.fa ktrim=r k=21 rcomp=f mink=9 hdist=1 hdist2=0 tpe tbo ftm=5 qtrim=rl trimq=20 maq=20 minlen=41 stats=bbduk_TileFilter_AdaptTrim_QCtrim_QCfilter_stats2.txt overwrite=TRUE
Executing jgi.BBDukF [in1=EthFoc-11.S285_L007.1.txt, in2=EthFoc-11.S285_L007.2.txt, out1=EthFoc11_bbduk_ALLadapters_1.fq, out2=EthFoc11_bbduk_ALLadapters_2.fq, outm1=EthFoc11_bbduk_ALLadapters_1_MATCH.fq, outm2=EthFoc11_bbduk_ALLadapters_2_MATCH.fq, ref=PC_adapters.fa, ktrim=r, k=21, rcomp=f, mink=9, hdist=1, hdist2=0, tpe, tbo, ftm=5, qtrim=rl, trimq=20, maq=20, minlen=41, stats=bbduk_TileFilter_AdaptTrim_QCtrim_QCfilter_stats2.txt, overwrite=TRUE]
BBDuk version 36.67
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=26374m, free=25824m, used=550m
Added 3371 kmers; time: 0.144 seconds.
Memory: max=26374m, free=24585m, used=1789m
Input is being processed as paired
Started output streams: 1.648 seconds.
Processing time: 94.316 seconds.
Input: 33192724 reads 4978908600 bases.
QTrimmed: 8375144 reads (25.23%) 421831220 bases (8.47%)
FTrimmed: 0 reads (0.00%) 0 bases (0.00%)
KTrimmed: 8976040 reads (27.04%) 460682762 bases (9.25%)
Trimmed by overlap: 1040268 reads (3.13%) 4897764 bases (0.10%)
Low quality discards: 0 reads (0.00%) 0 bases (0.00%)
Total Removed: 2110256 reads (6.36%) 887411746 bases (17.82%)
Result: 31082468 reads (93.64%) 4091496854 bases (82.18%)
Time: 96.121 seconds.
Reads Processed: 33192k 345.32k reads/sec
Bases Processed: 4978m 51.80m bases/sec
and this run's stats summary, using only 2 adapter sequences.
#File EthFoc-11.S285_L007.1.txt EthFoc-11.S285_L007.2.txt
#Total 33192724
#Matched 8820559 26.57377%
#Name Reads ReadsPct
TruSeq_Universal_Adapter_rc 4729103 14.24741%
TruSeq_Indexed_Primer 4091456 12.32636%
Which brings me back to the question whether casual inclusion of additional adapter sequences in the reference file, that were not used during library prep, can lead to spurious cases of trimming that will result in output file that is an artifact where both correct trimming AND additional, unnecessary, unwanted trimming has occurred?
FYI, I am looking at HiSeq 4000 generated data, from haploid fungal spores, generated using the Kapa Hyper Plus Library Prep Kit - https://www.kapabiosystems.com/assets/KAPA_Hyper_Plus_Kit_TDS_021715.pdf.
Thanks!
I can't say anything about trimmomatic since I don't use it but with bbduk we use a single file with most commonly used commercial kit adapter sequences that @Brian includes. Have you tried using bbduk? With
tbe tpo
flags you can remove even single base contaminants from adapters in PE reads.genomax - thanks for pointing that out, original post updated with bbduk results as well. You may respond to my comment below, if you wish. Thanks!
Have you checked to see how many optical duplicates you have in this dataset considering this is a 4K data (
clumpify
is your program there from BBMap). It is possible that some of these reads may be removed (if you eliminate optical dups) and then the trimming results would look more palatable.I have to read up to make sure I understand "optical duplicates", did not know there was such a thing until you posted - thanks for raising this point.
If I understand what you saying, you recommend a quick clumpify run to empirically determine if my HiSeq4000 PE data has a lot of optical duplicates, right? Which is known to happen more with the new HiSeq3000/4000 platforms, than with earlier Illumina platforms - right?
So then, I'd use bbmerge.sh on my R1 & R2, then run clumpify to see what it reports, yes? I would not need to use bbsplitpairs.sh yet, correct?
Yes to questions in second para. Read about the optical dups and index misassignments here, here and here. It may be best to run clumpify on the original files. Is there a reason you are merging the two reads? Does ALLPATH needs the reads to be merged?
Thanks for those links!
May be merge is not the right technical, but interleaved. Gotta check if bbmerge does interleaving. This step appears necessary because clumpify.sh --help says
reformat.sh
does the interleaving but you do not need to have reads interleaved forclumpify
. That was an old requirement.As Genomax said, that was an old requirement, so please download the latest version. As for BBMerge, optical duplicate detection should be done on the raw (or adapter-trimmed) files, not merged reads (which generally works but will make the process unnecessarily complicated with a merged and unmerged dataset).