Forum:Trimming adapter sequences - is it necessary?
4
34
Entering edit mode
8.1 years ago
Lars ★ 1.1k

Trimming adapter sequences - is it necessary?

Removal of adapter sequences in a process called read trimming, or clipping, is one of the first steps in analyzing NGS data. With more than 30 published adapter trimming tools there is a more than large choice for the appropriate tool. Yet, there is a debate whether this step really is as important as the number of tools suggests, or whether it is possible to skip this time-consuming step for many NGS applications. read more

enter image description here

NGS adapter RNA-Seq • 37k views
ADD COMMENT
4
Entering edit mode

Always best to be safe than be sorry later on.

ADD REPLY
2
Entering edit mode

It all depends what you want to achieve with your data.

ADD REPLY
2
Entering edit mode

Absolutely! But that's what they actually write in the linked text. Not always needed, but sometimes necessary. Nice and simple explanation of adapter contamination. Thanks to the writer.

ADD REPLY
1
Entering edit mode

this used to be a popular thing to do but most people doing resequencing (WES/WGS) don't trim sequences by quality or adapter. see http://gatkforums.broadinstitute.org/gatk/discussion/2957/read-trimming

ADD REPLY
0
Entering edit mode

Sorry for digging out such an old debate, but I was thinking maybe performing a local alignment would replace the trimming of the reads, be it the quality trimming or trimming adapters. Does it make sense? Thank you for any input.

ADD REPLY
0
Entering edit mode

Most NGS aligners are "local" in the sense of the strict-5'/loose-3'.

ADD REPLY
0
Entering edit mode

Bowtie2 for example has an additional option to perform local alignments like this (not a default option)

I think Subread-align works this way as well, but I don't know about other mapping programs.

ADD REPLY
0
Entering edit mode

Most NGS aligners are "local" in the sense of the strict-5'/loose-3'.

ADD REPLY
13
Entering edit mode
8.1 years ago

Let's put the question in the opposite way: What is the advantage of NOT doing adapter trimming?

Some thoughts:

  • Save computational time and disk space. True, but if you can pipe the output of the trimmer to the next step, typically the alignment, then this point fades (I put here a small howto Trim & align paired-end reads in a single pass )

  • You trim more then necessary. In this case just set the trimmer to be more conservative. If instead your genome has sequences identical to the adapter then you may have a problem anyway.

  • One more step means more things that can go wrong, more dependencies etc. I think this is a valid point.

Personally I'm quite happy to trim my sequences...

ADD COMMENT
3
Entering edit mode

Good point. But if you work with HiSeq X Ten datasets, the clipping itself needs several hours, with and without piping. But in that case a loss of 2% of your reads would be several million reads. I would also always clip the adapter sequences.

ADD REPLY
2
Entering edit mode

clipping itself needs several hours, with and without piping

Mmmh... Not sure about this, I haven't really timed it but if you can pipe trimming to alignment I would bet the limiting step is the alignment so the overhead of trimming is small.

ADD REPLY
4
Entering edit mode

Things are relative - not so long ago alignments may have taken many days - but aligners such as HISAT2 are incredibly fast - trimming could be the limiting factor. There is a lot less progress when it comes to QC tools.

Trimming is altering data - that is always a dangerous proposition. It is easy to set one parameter incorrectly (trim at shorter overlap and some others things) and that, all of a sudden makes the data enriched for some regions.

I would recommend trimming only when one identifies the presence of adapters as a problem.

ADD REPLY
13
Entering edit mode

Trimming with BBDuk is very fast, and allows piped output to an aligner... it's hard for me to imagine a scenario in which an aligner runs substantially faster on a genome bigger than a few dozen kb, in which case time won't really be a concern. And, for that matter, adapter-trimming with BBDuk is faster than identifying that you may have adapter contamination with, say, FastQC. Besides which, the presence of adapters can make many processes run slower - that includes alignment, error-correction, assembly, variant-calling, and various others. Even at low levels, possibly below your detection thresholds, not trimming adapters can cause bias - for example, under-representing short transcripts because of poor mapping due to mismatches in the adapter sequence.

To give some actual numbers, in kind of a worst-case scenario when mapping to PhiX, a 5386 bp reference -

First, I generated 4 million read pairs:

bbmap.sh ref=phix174_ill.ref.fa.gz
randomreads.sh reads=4m paired mininsert=100 maxinsert=400 len=150 out=stdout.fq fragadapter1=ACGTTTTGTGCACGTTCGTGTACCAGTTGGT fragadapter2=ACGTTTTGTGCACGTTCGTGTACCAGTTGGT illuminanames bell | reformat.sh in=stdin.fq out=r#.fq int ow

Then I trimmed them:

time bbduk.sh -Xmx1g in=r#.fq literal=ACGTTTTGTGCACGTTCGTGTACCAGTTGGT ktrim=r hdist=1 k=23 mink=11 tbo tpe out=trimmed#.fq ow

BBDuk version 36.32
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=1029m, free=1010m, used=19m

Added 1842 kmers; time:         0.032 seconds.
Memory: max=1029m, free=972m, used=57m

Input is being processed as paired
Started output streams: 0.260 seconds.
Processing time:                7.747 seconds.

Input:                          8000000 reads           1200000000 bases.
KTrimmed:                       98596 reads (1.23%)     2375384 bases (0.20%)
Trimmed by overlap:             69670 reads (0.87%)     359090 bases (0.03%)
Total Removed:                  0 reads (0.00%)         2734474 bases (0.23%)
Result:                         8000000 reads (100.00%)         1197265526 bases (99.77%)

Time:                           8.043 seconds.
Reads Processed:       8000k    994.68k reads/sec
Bases Processed:       1200m    149.20m bases/sec

real    0m8.339s
user    3m41.535s
sys     0m2.934s

Then I mapped both sets of reads:

time bbmap.sh -Xmx31g in=r#.fq

real    0m25.215s
user    10m19.473s
sys     0m5.501s


time bbmap.sh -Xmx31g in=trimmed#.fq out=mapped2.sam ow

real    0m15.913s
user    5m10.692s
sys     0m5.910s

The alignment rate was the same (100% read 1, 99.3% read 2), but the adapter-trimmed reads were actually faster (and, of course, had a lower observed error rate of 7.7% versus 9.7% since the adapters were gone). In fact, I did not expect it, but the adapter-trimmed reads were faster by more than the amount of time it took to do adapter-trimming - and on large or repetitive genomes, this difference would increase proportionally. This is despite only ~2% of the reads being trimmed and losing only 0.23% of the total bases. It's mainly because BBMap is a global aligner, though, and fills in more of the dynamic-programming alignment matrix when a read does not match the reference well; on a local aligner designed for quantification rather than variant detection, the results might be very different.

P.S. To reiterate, note that on a large and repetitive genome such as human, adapter-trimming would take the same amount of time but alignment would be dozens of times slower.

ADD REPLY
1
Entering edit mode

Thanks for checking that, Brian. We use bbduk a lot and it's really a pretty fast clipper.

ADD REPLY
1
Entering edit mode

Thanks for taking the time and backing everything up with real data.

ADD REPLY
0
Entering edit mode

Ran into the character limit, but here's an example of error-correction.

Raw:

tadpole.sh in=r#.fq ecc

Unique Kmers:                   996335

Input:                          8000000 reads           1200000000 bases.
Output:                         8000000 reads           1200000000 bases.
Errors detected:                1959038
Errors corrected:               683679  (683679 reassemble)
Reads with errors detected:     748397  (9.35%)
Reads fully corrected:          630039  (84.19% of detected)
Reads partly corrected:         5945    (0.79% of detected)
Rollbacks:                      322     (0.04% of detected)
Extend/error-correct time:      3.942 seconds.

Total Time:                     8.277 seconds.

Trimmed:

tadpole.sh in=trimmed#.fq ecc
Unique Kmers:                   635323

Input:                          8000000 reads           1197265526 bases.
Output:                         8000000 reads           1197265526 bases.
Errors detected:                650676
Errors corrected:               648971  (648971 reassemble)
Reads with errors detected:     617883  (7.72%)
Reads fully corrected:          616842  (99.83% of detected)
Reads partly corrected:         31      (0.01% of detected)
Rollbacks:                      32      (0.01% of detected)
Extend/error-correct time:      3.750 seconds.

Total Time:                     8.120 seconds.

In this case the time did not change much (though it does, on a large genome) but note the number of unique kmers dropped from 996335 to 635323 as a result of trimming. The memory required for assembly and error correction is often a linear function of the number of unique kmers (depending on how kmers are stored). Also, the number of fully-corrected reads changed from 84.19% of reads detected as having errors to 99.83% of reads detected as having errors, which is a huge increase (since adapter sequence will generally NOT get corrected to genomic sequence, but sequencing errors will). Having nearly 100% of reads error-free is often nice.

ADD REPLY
1
Entering edit mode

That is a dramatic change in kmers - I am trying to get a grip on the magnitude of it all. If that is in fact how it works I'll be a champion of trimming at all times.

Here it seems trimming adapters from just 2% of reads (and just 0.23% of total sequenced bases) drops the kmer numbers from 900K to 600K - that is a gigantic change. Would it be possible to isolate just the trimmed sequences and run this tadpole.sh on just that section.

ADD REPLY
0
Entering edit mode

Sure:

Filtering to get the names of the trimmed reads (that got trimmed, since original read length was 150bp):

reformat.sh in=trimmed#.fq out=short.fq maxlen=149

Input:                          8000000 reads           1197265526 bases
Short Read Discards:            7831734 reads (97.90%)  1174760100 bases (98.12%)
Output:                         168266 reads (2.10%)    22505426 bases (1.88%)

Filtering by name to get the raw reads with and without adapters:

filterbyname.sh in=r#.fq names=short.fq out=readsWithAdapters.fq include=t
filterbyname.sh in=r#.fq names=short.fq out=readsWithoutAdapters.fq include=f

Error-correction:

tadpole.sh in=readsWithAdapters.fq ecc

Unique Kmers:                   565379
Input:                          168266 reads            25239900 bases.
Output:                         168266 reads            25239900 bases.
Errors detected:                1156957
Errors corrected:               46654   (46654 reassemble)
Reads with errors detected:     134348  (79.84%)
Reads fully corrected:          24621   (18.33% of detected)
Reads partly corrected:         5964    (4.44% of detected)
Rollbacks:                      327     (0.24% of detected)
Extend/error-correct time:      0.606 seconds.

tadpole.sh in=readsWithoutAdapters.fq ecc

Unique Kmers:                   633463
Input:                          7831734 reads           1174760100 bases.
Output:                         7831734 reads           1174760100 bases.
Errors detected:                639162
Errors corrected:               637469  (637469 reassemble)
Reads with errors detected:     606851  (7.75%)
Reads fully corrected:          605814  (99.83% of detected)
Reads partly corrected:         31      (0.01% of detected)
Rollbacks:                      32      (0.01% of detected)
Extend/error-correct time:      5.280 seconds.

Note that they do not sum to 996335 because most of the unique kmers come from errors and are shared due to high coverage.

Also, bear in mind that the effect of adapters on kmer count is magnified as well by the high coverage. Adapter-trimming does not usually yield nearly that big of a relative difference when the coverage is ~100x.

ADD REPLY
0
Entering edit mode

Thanks for putting up with all the requests. There is something weird in this data though - but that in itself is not surprising IMO. Good data are all alike - bad data can be weird in all kinds of unique ways.

Here the first 168K reads produce over 1.1 million errors (4.5% error rate) and almost just as many kmers as the other 7.8 million reads. That is really unexpected since this latter is 46 times more data. This alone cannot be explained merely by the presence of an adapter. After all having an adapter left over on these sequences should actually reduce the number of kmers - we'd have the same sequence at the end forming the same kmers.

What it looks like is that the same sequences with adapters also have extremely high rates of miscalls. Errors are indeed a source of lots of kmers. Is that because of the adapters or maybe these are so bad that they start to look like adapters because of the high rates of miscalls? It is bad sequence either way but it does open up some questions.

ADD REPLY
0
Entering edit mode

It's possible that the source of misunderstanding is that there was super-high coverage. I was trying to generate the worst-case scenario in which adapter-trimming would be least effective, speed-wise, prior to alignment. I chose PhiX because it is a tiny genome, so it should require the most time spent on trimming (which is proportional to the data volume) compared to alignment (which is related to reference size). I was surprised when I found this to not be the case.

When you say "What it looks like is that the same sequences with adapters also have extremely high rates of miscalls." - it's important to understand that sequences with adapters that are not trimmed, will result in miscalls when mapping, or assembling, etc. It is sequence presented as being genomic, with (often) high quality, that is in fact not high-quality genomic sequence, but just synthetic sequence chimerically fused to genomic sequence. The high quality calls are real - it is, in fact, valid sequence - it's just synthetic, rather than coming from the genome.

Tadpole, by default, uses 31-mers. When you consider 31-mers - if there is a single junction between an adapter and genomic sequence, an error-free read will still generate 31 erroneous kmers. Reads with genomic-adapter junctions (meaning, reads with insert size less that read length), which are shorter than the insert size, can spawn a lot of these things. The number depends on the insert size versus the read length, but if the insert size is at least 31 bp less than the read length, a read can spawn 31 error kmers, from a perfectly accurate read.

BBTools do not assign a higher error rate to short-insert reads than any other reads. Reads are assigned quality scores, and errors are determined based on those scores. This includes adapter sequence, so adapters are not exact - they are also randomly given errors based on quality scores. Since RandomReads by default models Illumina's error profile of decreased accuracy with each cycle, bases at the end always have the highest error rate, which makes adapters harder to trim.

Essentially, high coverage inflates the effect of adapter-trimming on unique kmer counts. I thought I made that very clear in my posts. But I probably should have made it much more clear that this situation has artificially very high coverage, and I should have generated a completely different dataset for testing kmer frequencies, using a different organism and a different expected worst-case scenario. I apologize for failing to do that.

ADD REPLY
0
Entering edit mode

Sounds good. That does indeed clarify things.

Even in real life situation I do run into situations when people generate data with overly high coverage over small genomes (their goal is to make results more reliable) - but high coverage in turn can introduce a whole slew of other errors. Looks like error correction, trimming becomes even more important in that situation.

This will be a helpful thread to refer to in the future.

ADD REPLY
0
Entering edit mode

Most of current variant calling models do not work well with >1000X coverage. I always believe it is dependent errors to blame. And because of error dependencies, error correction should be used with caution. Error correctors usually need to discard some information to fit the profiles into memory. These information may be important to identify dependent errors. Variant callers only hold a small fraction of data in memory. It can take full advantage of all information, if modeled properly.

ADD REPLY
1
Entering edit mode

Talking about error correction, I tried bbmap when I wrote the bfc paper. At that time (>1.5 years ago), bbmap did not compete well with other error correctors. In the end, I removed the bbmap results in the paper, but you can still find the numbers in the tex file and command lines in README.md in that directory. I think you need to evaluate the bbmap error corrector together with others, if you have not done that yet. Note that generally, we do NOT correct sequencing errors before variant calling. Variant callers are much better at separating variants and errors. K-mer based correctors occasionally introduce new errors that fool variant callers. I have seen this multiple times.

On adapter trimming, I fully agree with you that it should be done on the fly. I rarely use published adapter trimmers because most of them are way too slow.

ADD REPLY
1
Entering edit mode

I've been working on Tadpole quite a bit over the last few months on metagenomes and single cells, using bfc as a benchmark for comparison. It does quite well now; the resulting Spades assemblies are similar for metagenomes, but Tadpole does substantially better with single cells. Tadpole's faster, though.

ADD REPLY
2
Entering edit mode

Thanks. I now see tadpole is a new tool and my old evaluation did not apply at all. Good to know! Yes, single-cell and metagenomics require special considerations.

ADD REPLY
0
Entering edit mode

Hi Istvan- Thanks for chipping in. Yes, good point I wasn't thinking about "newer" aligners like hisat or mappers like kallisto.

About your second point, though, I'm not sure I entirely agree. In my experience, mostly with cutadapt, trimming sequences with little or no adapter contamination returns >99% of the sequenced bases back (bases not reads). If the final results depend on that <1% that looks like adapter, than there is a problem anyway. I think you have to set the trimming parameters really wrong, or misunderstand them, to tangibly bias the results.

Yes, in principle I agree that altering the data is always risky and other things being equal it's better to avoid it (I'm thinking about some discussions here on biostars about batch correction for gene expression giving unexpected results). But when does adapter presence become a problem...1%, 10%.. or never? In doubt I prefer to trim, especially with bisulfite data.

ADD REPLY
0
Entering edit mode

That's a very good point, Istvan. I'm absolutely with you. We actually saw high fluctuations in called variants when using different clipping algorithms and the exact same downstream workflow.

ADD REPLY
1
Entering edit mode

David,

We actually saw high fluctuations in called variants when using different clipping algorithms

I reiterate my comment above, if the results depend on that little % of differential trimming than the analysis is fragile and the results should be taken with care, regardless of trimming. If different trimming algorithms give very different results, with similar settings, than maybe one of them is doing something strange or they are used improperly. (Once I quickly compared cutadapt with the output of the trimmer embedded in the Illumina pipeline and the results were virtually identical, sorry I don't have the data at hand!)

ADD REPLY
0
Entering edit mode

We were also surprised by this observation and right now we're about to study the cause. Unfortunately, I cannot share any example, since it's clinical data. But if we find out the problem and it is due to some software issue, I'll share it.

ADD REPLY
0
Entering edit mode

Should be easy to test, no?

ADD REPLY
0
Entering edit mode

If you pipe reads directly into the aligner, the speed-limiting factor will always (with good current software) be the alignment and never the trimming/clipping. When running cutadapt piped into bowtie2, cutadapt runs at somewhat < 50% CPU when I use 16 threads for bowtie2, which is not a surprise as trimming is a trivial task compared to alignment.

ADD REPLY
2
Entering edit mode

As for point 1... I would say that adapter-trimming saves space (regardless of whether you use piping) and often saves computational time (see my example for alignment, but this is especially true for kmer-counting-related tasks such as assembly and for error-related tasks like error-correction and probably variant-calling, depending on the algorithm).

Though I agree with you overall, of course.

ADD REPLY
0
Entering edit mode

Brain, thank you for your comments, very interesting observations! About adapter trimming saving space is true if you replace the raw reads with the trimmed ones (and you have copy of the raw reads elsewhere of course), but still there is that moment where you have both raw and trimmed reads at the same time (I did have problems with disk quota because of this!)

ADD REPLY
8
Entering edit mode
8.1 years ago

Here is a publication with some benchmarking on adapter trimming: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4862148/

It nicely shows the effect of trimming to the mapping time, as discussed above.

enter image description here

ADD COMMENT
1
Entering edit mode
ADD REPLY
5
Entering edit mode

That's slightly misleading in the context of this thread; to clarify, the linked article is about quality trimming, not adapter-trimming.

ADD REPLY
1
Entering edit mode

You are correct. But quality trimming is mostly done together with adapter trimming. Still different topic. Thank you for your clarification.

ADD REPLY
6
Entering edit mode
8.1 years ago

I would argue it is definitely necessary for mate pair reads that are used for an assembly. You will very likely see big improvement in N50 after you fix this problem. The contamination of the adaptors is high (it's not unusual to see ~30% of the reads with the biotinylated adaptors still present). I explain this in my blog http://sites.psu.edu/biomonika/2015/07/03/trim-now-or-never-pe-and-mp-adaptor-removal/ (section Trimming MATE-PAIR reads and below)

link to the Nextera Mate Pair Reads Protocol here: http://www.illumina.com/documents/products/technotes/technote_nextera_matepair_data_processing.pdf

ADD COMMENT
4
Entering edit mode
8.1 years ago
John 13k

There are applications (e.g. small RNA sequencing) where adapter trimming is highly necessary. But there are also applications (transcriptome sequencing, whole genome sequencing, etc.) where adapter contamination can be expected to be so small (due to an appropriate size selection) that one could consider to skip the adapter removal and thereby save time and efforts.

I also don't wash my dishes after a meal. I mean sure, there are some meals (e.g. cereal, pasta) where cleaning the dishes is highly necessary. But there are also meals (pizza, fajitas, etc.) where leftover food can be expected to be so small (due to appropriate hunger) that one could consider to skip cleaning the dishes and thereby save time and efforts.

(also you must licence Highcharts if you're going to replace their name with ecSeq GmbH)

ADD COMMENT
0
Entering edit mode

(also you must licence Highcharts if you're going to replace their name with ecSeq GmbH)

???

ADD REPLY
4
Entering edit mode

Since you (OP) are apparently part of ecSeq, John is taking the opportunity to point out that, at the bottom of the ecSeq page that you linked to, there is an illegal (as of 9/16/2016) usage of HighCharts.

ADD REPLY
3
Entering edit mode

Thanks John and Keith for your free of charge legal advice. It seems both of you are really eager to keep my company out of trouble. Nevertheless... trust me... you can relax. :)

Furthermore, Lars is unfortunately not on our payroll, but I know this intelligent guy quite well. If you don't want him to share our blog entries, you should tell him directly. Writing some random, unfitting stuff helps nobody.

But it seems that people here actually enjoy the discussion. I definitely do!

ADD REPLY
9
Entering edit mode

You can always pay me for legal advice if you're not OK with it being free ;-)

As for everything else you said, all i'm getting is that you really really like Lars. That's great. I also like Lars. He writes well. I also like his diagrams.

I do not, however, agree that one should not mark adapters simply because it saves time and effort. That, despite being written by Lars, is bad advice. It sounds a lot like not washing the dishes after a meal because it takes too much time and effort. It is particularly bad advice to be giving new users who I presume are the target-audience for such blog posts.

To be honest this whole thread and many of it's answers have been a bit surprising to me. I'm not usually one to take the moral high-ground because frankly I'm an asshole. Today someone asked me if I could spare some change and I said "No. I need a Starbucks.", then walked into Starbucks. I don't even like Starbucks. I just buy it to keep my hands warm while I take the train to work. When I got to work it was cold so I threw it in the bin. I'm a huge asshole.

Never-the-less, like many people on this forum I work on behalf of the taxpayer - people who are giving a small percentage of their yearly income to us to work out these biological problems. When they give us their money, I presume they are not OK with us doing a half-ass'd job. I'm not even slightly surprised that leaving in adapters increases mapping time. I'm certainly not surprised that it effects variant calling. And of course for anything kmer related, some proportion of your data is going to be worse-than-noise if you leave adapters in. But you know the real reason we should mark adapters? Because for many of us, that's our job, and even if the tax-payer never looks at the data we produce, they deserve their CIGAR strings to have the adapter sequences soft-clipped and marked as adapters in the tags, than just sitting there in the data waiting for someone to trip over it.

ADD REPLY
2
Entering edit mode

Thanks for your input.

ADD REPLY

Login before adding your answer.

Traffic: 1600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6