General question about batch effect, read trimming and what to do when the adapter trimming step is not working appropriately.
4
1
Entering edit mode
5.6 years ago
Mozart ▴ 330

Hello everyone, I have a huge dataset with a bunch of human samples to analyse. Of course, I run into troubles because the samples come from different donors and when I PCA those samples, well...it's a bit dodgy. They cluster according to their condition but I am not sure about how am I supposed to deal with this batch effect? A few time ago, I used SVA package but I wasn't happy with that.

A problem related to this is probably due to the fact that my samples are not trimmed appropriately. I have a lot of problem with the facility that generated these fastq files because sometimes they provide me trimmed samples, sometimes they don't (given the fact that this whole dataset comes from different batches/years). Thus, my questions:

  1. Don't you think that all of my samples, to generate useful data, must have been processed in the same identical way (e.g. same Sliding window, leading, trailing, minlen)? I am quite confused about this.
  2. What if, by any chance, I trim an already-trimmed file?
  3. When I am trying to trim my samples, I don't manage to remove adapter contamination..according to my beloved multiqc report there's a huge nextera transposase sequence contamination that Trimmomatic can't remove, even when selecting specific adapters...

Yours, M

RNA-Seq trimmomatic trimming adapters • 2.5k views
ADD COMMENT
0
Entering edit mode

when I PCA those samples, well...it's a bit dodgy

How was PCA done and how was data normalization/regularization performed?

They cluster according to their condition

Isnt't that expected as this is the biological difference?

problem with the facility that generated these fastq files because sometimes they provide me trimmed samples

It is very uncommon that facilitites provide adapter-trimmed samples. Do you really mean trimmed or demultiplexed?

As for the questions 1-3:

  1. Yes data should be uniformly processed but re-trimming a dataset is probably not harmful as there should be little effect if indeed the adapter sequence is not present anymore.
  2. see 1)
  3. Did you provide the correct adapter sequence? See for example code in the web. If the sequence persists, your command is somewhat wrong. Can you share some command lines?
ADD REPLY
0
Entering edit mode

As a small addition, do a fastQC report for each sample before and after trimming. Afterwards, run on the reports the multiqc tool.

Then you'll see the differences in adapter content, read length, etc.

ADD REPLY
0
Entering edit mode

Thanks ATpoint for your question. I am judging the PCA according to someone else's analysis. I hadn't got the chance to get to that point yet. By the way, I guess there is very little variation amongst the different samples.

Anyway I solved the issue but, as you can see below, I am not sure if I have to use either paired or unpaired samples, after trimming.

ADD REPLY
0
Entering edit mode

I have recently used Trimmomatic to remove nextera transposase sequence so it is probably just a matter of providing the correct sequence to use.

ADD REPLY
1
Entering edit mode

Agreed- The standard tools (I use cutadapt) all perform more or less equally-well and if it does not work it is 99.9% of the time a user-induced problem (=wrong commands, wrong adapter sequences provided etc.)

ADD REPLY
0
Entering edit mode

So, for quality sake, paired reads may show a better reliability for the further steps.

Can anyone confirm this?

ADD REPLY
3
Entering edit mode
5.6 years ago

A problem related to this is probably due to the fact that my samples are not trimmed appropriately.

I wouldn't be so sure. For instance, I know that STAR aligner is pretty robust to having wrong sequence on the ends of reads.

If your trimmer isn't trimming anything, maybe nothing needs trimming. If you have a big batch effect, that's likely real, and not an artifact you can fix.

ADD COMMENT
0
Entering edit mode

Thanks swbarnes2. I am now uncertain about the following step: I have re-read timmomatic manual and as you know for pair ended analysis you generate with it 4 output files. Amongst the latter, should I used paired output for the downstream analysis?

ADD REPLY
0
Entering edit mode

Mozart, the paired files are what you use. I just want to confirm with you though...Are they the correct Adaptor sequences that your sequencing was performed with? It may be that the adaptor sequences were removed by the facility, common. As such, very low % of reads will be trimmed..

Link to Nextera adaptor information

You can go to the link above / contact the sequencing facility and check if the adaptors used are the same as within the Trimmomatic NexteraPE-PE.fa file. Best to do now before you proceed with downstream analyses.

ADD REPLY
0
Entering edit mode

Thanks Biogeek. I have to double check this again. I knew that BCL data coming out from the sequencer (and then converted into Fastq files) are subjected to an adapter trimming step so I may have untrimmed samples with no contamination...then, what if I don't have an adapter contamination in my untrimmed samples and the sequence quality is OK to perform downstream analysis (i.e. alignment)? Should I perform the trimming step, anyway?

ADD REPLY
1
Entering edit mode
5.6 years ago

Try alternative trimmers too. I use fastp and ea-utils fastq-mcf for tricky samples besides the standard Trimmomatic.

I also use multiple rounds of trimming to eg, remove adapters from some tricky short sequences, eg miRNAs or amplicons.

Multiple rounds of FASTQC and Multiqc are also necessary.

ADD COMMENT
1
Entering edit mode
5.6 years ago
Biogeek ▴ 470

I'd recommend using BBDUK under the bb tools suite by Brian Bushnell. It has an extensive adapter.fa file containing all publicly available adaptor sequences - just an idea? The amount of times people sue Trimmomatic without the correct adaptor sequence .fa file. Admittedly I also made that mistake and realised once. The performance of BBDUK is supposedly superior to Trimmomatic.

Once you've tried BBDUK, report back the QC results. The log output will also inform you of adaptor sequence % detected and removed.

Best.

ADD COMMENT
0
Entering edit mode
5.6 years ago
Mozart ▴ 330

Thanks all of you for the useful replies. Following the code I am using:

java -jar /Users/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 -threads 4 /Users/FASTQ/sample1_R1_001.fastq.gz /Users/FASTQ/sample1_R2_001.fastq.gz /Users/FASTQ/sample1_R1_paired.fastq.gz /Users/FASTQ/sample1_R1_unpaired.fastq.gz /Users/FASTQ/sample1_R2_paired.fastq.gz /Users/FASTQ/sample1_R2_unpaired.fastq.gz 
ILLUMINACLIP:/Users/Trimmomatic-0.39/adapters/NexteraPE-PE.fa SLIDINGWINDOW:value LEADING:value TRAILING:value MINLEN:value

It seems to work now, because I slightly changed the code to be honest. In fact looking at the QC report again, it seems I managed to remove the adapter contamination

At the end of this process, should I use the paired file for the downstream analysis, right?

Thanks,

M

ADD COMMENT
0
Entering edit mode

I have re-read the manual again and again. The paired output file is fastq trimmed in which both reads (contained in each fastq file) survived the processing.

ADD REPLY

Login before adding your answer.

Traffic: 2879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6