Hello everyone, I have a huge dataset with a bunch of human samples to analyse. Of course, I run into troubles because the samples come from different donors and when I PCA those samples, well...it's a bit dodgy. They cluster according to their condition but I am not sure about how am I supposed to deal with this batch effect? A few time ago, I used SVA package but I wasn't happy with that.
A problem related to this is probably due to the fact that my samples are not trimmed appropriately. I have a lot of problem with the facility that generated these fastq files because sometimes they provide me trimmed samples, sometimes they don't (given the fact that this whole dataset comes from different batches/years). Thus, my questions:
- Don't you think that all of my samples, to generate useful data, must have been processed in the same identical way (e.g. same Sliding window, leading, trailing, minlen)? I am quite confused about this.
- What if, by any chance, I trim an already-trimmed file?
- When I am trying to trim my samples, I don't manage to remove adapter contamination..according to my beloved multiqc report there's a huge nextera transposase sequence contamination that Trimmomatic can't remove, even when selecting specific adapters...
Yours, M
How was PCA done and how was data normalization/regularization performed?
Isnt't that expected as this is the biological difference?
It is very uncommon that facilitites provide adapter-trimmed samples. Do you really mean trimmed or demultiplexed?
As for the questions 1-3:
As a small addition, do a fastQC report for each sample before and after trimming. Afterwards, run on the reports the multiqc tool.
Then you'll see the differences in adapter content, read length, etc.
Thanks ATpoint for your question. I am judging the PCA according to someone else's analysis. I hadn't got the chance to get to that point yet. By the way, I guess there is very little variation amongst the different samples.
Anyway I solved the issue but, as you can see below, I am not sure if I have to use either paired or unpaired samples, after trimming.
I have recently used Trimmomatic to remove nextera transposase sequence so it is probably just a matter of providing the correct sequence to use.
Agreed- The standard tools (I use
cutadapt
) all perform more or less equally-well and if it does not work it is 99.9% of the time a user-induced problem (=wrong commands, wrong adapter sequences provided etc.)So, for quality sake, paired reads may show a better reliability for the further steps.
Can anyone confirm this?