Question

General question about batch effect, read trimming and what to do when the adapter trimming step is not working appropriately.

1

Entering edit mode

6.0 years ago

Mozart ▴ 330

Hello everyone, I have a huge dataset with a bunch of human samples to analyse. Of course, I run into troubles because the samples come from different donors and when I PCA those samples, well...it's a bit dodgy. They cluster according to their condition but I am not sure about how am I supposed to deal with this batch effect? A few time ago, I used SVA package but I wasn't happy with that.

A problem related to this is probably due to the fact that my samples are not trimmed appropriately. I have a lot of problem with the facility that generated these fastq files because sometimes they provide me trimmed samples, sometimes they don't (given the fact that this whole dataset comes from different batches/years). Thus, my questions:

Don't you think that all of my samples, to generate useful data, must have been processed in the same identical way (e.g. same Sliding window, leading, trailing, minlen)? I am quite confused about this.
What if, by any chance, I trim an already-trimmed file?
When I am trying to trim my samples, I don't manage to remove adapter contamination..according to my beloved multiqc report there's a huge nextera transposase sequence contamination that Trimmomatic can't remove, even when selecting specific adapters...

Yours, M

RNA-Seq trimmomatic trimming adapters • 2.8k views

ADD COMMENT • link updated 6.0 years ago by swbarnes2 14k • written 6.0 years ago by Mozart ▴ 330

0

Entering edit mode

when I PCA those samples, well...it's a bit dodgy

How was PCA done and how was data normalization/regularization performed?

They cluster according to their condition

Isnt't that expected as this is the biological difference?

problem with the facility that generated these fastq files because sometimes they provide me trimmed samples

It is very uncommon that facilitites provide adapter-trimmed samples. Do you really mean trimmed or demultiplexed?

As for the questions 1-3:

Yes data should be uniformly processed but re-trimming a dataset is probably not harmful as there should be little effect if indeed the adapter sequence is not present anymore.
see 1)
Did you provide the correct adapter sequence? See for example code in the web. If the sequence persists, your command is somewhat wrong. Can you share some command lines?

ADD REPLY • link 6.0 years ago by ATpoint 87k

0

Entering edit mode

As a small addition, do a fastQC report for each sample before and after trimming. Afterwards, run on the reports the multiqc tool.

Then you'll see the differences in adapter content, read length, etc.

ADD REPLY • link 6.0 years ago by michael.ante ★ 4.0k

0

Entering edit mode

Thanks ATpoint for your question. I am judging the PCA according to someone else's analysis. I hadn't got the chance to get to that point yet. By the way, I guess there is very little variation amongst the different samples.

Anyway I solved the issue but, as you can see below, I am not sure if I have to use either paired or unpaired samples, after trimming.

ADD REPLY • link 6.0 years ago by Mozart ▴ 330

0

Entering edit mode

I have recently used Trimmomatic to remove nextera transposase sequence so it is probably just a matter of providing the correct sequence to use.

ADD REPLY • link 6.0 years ago by Kristoffer Vitting-Seerup ★ 4.1k

1

Entering edit mode

Agreed- The standard tools (I use cutadapt) all perform more or less equally-well and if it does not work it is 99.9% of the time a user-induced problem (=wrong commands, wrong adapter sequences provided etc.)

ADD REPLY • link 6.0 years ago by ATpoint 87k

0

Entering edit mode

So, for quality sake, paired reads may show a better reliability for the further steps.

Can anyone confirm this?

ADD REPLY • link 6.0 years ago by Mozart ▴ 330

1

Entering edit mode

6.0 years ago

colindaven 7.4k

Try alternative trimmers too. I use fastp and ea-utils fastq-mcf for tricky samples besides the standard Trimmomatic.

I also use multiple rounds of trimming to eg, remove adapters from some tricky short sequences, eg miRNAs or amplicons.

Multiple rounds of FASTQC and Multiqc are also necessary.

ADD COMMENT • link 6.0 years ago by colindaven 7.4k

1

Entering edit mode

6.0 years ago

Biogeek ▴ 480

I'd recommend using BBDUK under the bb tools suite by Brian Bushnell. It has an extensive adapter.fa file containing all publicly available adaptor sequences - just an idea? The amount of times people sue Trimmomatic without the correct adaptor sequence .fa file. Admittedly I also made that mistake and realised once. The performance of BBDUK is supposedly superior to Trimmomatic.

Once you've tried BBDUK, report back the QC results. The log output will also inform you of adaptor sequence % detected and removed.

Best.

ADD COMMENT • link 6.0 years ago by Biogeek ▴ 480

0

Entering edit mode

6.0 years ago

Mozart ▴ 330

Thanks all of you for the useful replies. Following the code I am using:

java -jar /Users/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 -threads 4 /Users/FASTQ/sample1_R1_001.fastq.gz /Users/FASTQ/sample1_R2_001.fastq.gz /Users/FASTQ/sample1_R1_paired.fastq.gz /Users/FASTQ/sample1_R1_unpaired.fastq.gz /Users/FASTQ/sample1_R2_paired.fastq.gz /Users/FASTQ/sample1_R2_unpaired.fastq.gz 
ILLUMINACLIP:/Users/Trimmomatic-0.39/adapters/NexteraPE-PE.fa SLIDINGWINDOW:value LEADING:value TRAILING:value MINLEN:value

It seems to work now, because I slightly changed the code to be honest. In fact looking at the QC report again, it seems I managed to remove the adapter contamination

At the end of this process, should I use the paired file for the downstream analysis, right?

Thanks,

M

ADD COMMENT • link 6.0 years ago by Mozart ▴ 330

0

Entering edit mode

I have re-read the manual again and again. The paired output file is fastq trimmed in which both reads (contained in each fastq file) survived the processing.

ADD REPLY • link 6.0 years ago by Mozart ▴ 330

score 3 · Accepted Answer · 2019-04-30

3

Entering edit mode

6.0 years ago

swbarnes2 14k

A problem related to this is probably due to the fact that my samples are not trimmed appropriately.

I wouldn't be so sure. For instance, I know that STAR aligner is pretty robust to having wrong sequence on the ends of reads.

If your trimmer isn't trimming anything, maybe nothing needs trimming. If you have a big batch effect, that's likely real, and not an artifact you can fix.

ADD COMMENT • link 6.0 years ago by swbarnes2 14k

0

Entering edit mode

Thanks swbarnes2. I am now uncertain about the following step: I have re-read timmomatic manual and as you know for pair ended analysis you generate with it 4 output files. Amongst the latter, should I used paired output for the downstream analysis?

ADD REPLY • link 6.0 years ago by Mozart ▴ 330

0

Entering edit mode

Mozart, the paired files are what you use. I just want to confirm with you though...Are they the correct Adaptor sequences that your sequencing was performed with? It may be that the adaptor sequences were removed by the facility, common. As such, very low % of reads will be trimmed..

Link to Nextera adaptor information

You can go to the link above / contact the sequencing facility and check if the adaptors used are the same as within the Trimmomatic NexteraPE-PE.fa file. Best to do now before you proceed with downstream analyses.

ADD REPLY • link 5.9 years ago by Biogeek ▴ 480

0

Entering edit mode

Thanks Biogeek. I have to double check this again. I knew that BCL data coming out from the sequencer (and then converted into Fastq files) are subjected to an adapter trimming step so I may have untrimmed samples with no contamination...then, what if I don't have an adapter contamination in my untrimmed samples and the sequence quality is OK to perform downstream analysis (i.e. alignment)? Should I perform the trimming step, anyway?

ADD REPLY • link 5.9 years ago by Mozart ▴ 330