How I know if it is a batch effect or are real differences between my controls and treatments?
2
2
Entering edit mode
7 months ago

Greetings.

I´m working with transcriptomics data of Aedes aegypti and I have a doubt about the probably bacth effect I´m having with my samples. The controls I´m using belong to an experiment performed the past year with a sequencer from a different city than The treatments that are an experiment performed this year. The idea is to identify differentially expressed genes, comparing the treatments between them and between the controls.

I have this Count Table:

Count table of RNAseq data. S1 to S3 referes to Controls (Insects withouth treatment), L1 to L9 (And P1 to P9) are treatments. L1. L2 and L3 are replicates of a treatment with lethal concentration 25, L4 to L6 is the same treatment with a Lethal concentration 75 but insects who died (Susceptible), L7 to L9 are the insects the resisted Lethal concentration 75 (Resistant) The same occurs to samples from P1 to P9 but with other treatment with different action mechanism Same as for first table

This is my experimental design:enter image description here

When I perform PCA on all my data I get this result: enter image description here

You can clearly see a difference between controls and samples. The thing is that we assumed there is a batch effect between samples and controls because of the huge difference between them and that we know there are from sequencers from different cities. So, I want to know if this is really a batch effect or is it a real difference between my treatmens and controls. If it is a batch effect, how can I be sure?

I´m not well versed on RNA seq, nor statistics. I´m just getting the hang out of it and english is not my first language so i´m sorry for any mistakes.

RNA-seq Transcriptomics ComBat-seq Batch-Effect • 1.8k views
ADD COMMENT
2
Entering edit mode

I'm going to assume "Susceptible" samples are the control, but you have not explained this. My understanding is S1-3 were sequenced together, and L1-9 and P-19 were sequenced together. I have a few questions:

  1. Was the RNA extraction and processing for all samples done at the same time, using the same kits?
  2. Were the sequencing platforms the same?
  3. Do all samples have a similar number of reads?
  4. If you make a presence/absence PCA, do you still see clear separation?

To me, it looks like your susceptible samples have a magnitude more reads than treatments, but this is only from seeing the first 10 transcripts, so I it's just a guess.

ADD REPLY
0
Entering edit mode

Greetings, and thank you for your answer.

Indeed. Susceptible S1-3 are the control. Sorry for not be clear about it.

  1. The process of all the samples were performed like this: -S1-3 were processed at the same time in 2023 with the purpose of being the control of another experiment. As these controls are insects withouth any treatment we thought it could be possible to use it in this new experiment. -L1-9 were proccessed at the same time. The same for P1-9. Processed in 2024.
  2. Yes, sequencing paltforms were the same. Illumina paired end.
  3. No. They do not have the same number of reads. The mean of mapped reads I got was 90 millions in controls while in the treatments is roughly 25 million reads mapped to the reference genome. We thought of this, however, as lab wanted to save the money for the controls they were not sent for sequencing again in 2024.
  4. Do you mean by having removed the controls?

Thank you

ADD REPLY
0
Entering edit mode

I would still try looking at presence/absence PCA, and randomly down-sampling read pairs to the mean number of reads in the treatments and repeating both PCAs.

It's likely you still won't be able to identify if this is real or batch-effect because the batch-effect would be nested in treatment vs controls. But everything I've highlighted is explained in more detail below.

ADD REPLY
8
Entering edit mode
7 months ago

Unfortunately, when you have a batch effect that maps exactly onto an experimental variable, as you have here, it is impossible to disentangle the batch effect from the experimental variable. I wouldn't go as far as to say this is definately a batch effect, rather it is practically, mathematically and philopshophically impossible to work out whether it is a batch effect or an experimental effect.

What can be done about it? There is no software tool that can really help you here. All that somthing like combat-seq can do is tell whether one set of samples is different to an other. Generally it identifies groups of samples that are coherently different from the other samples and are not samples of one condition or another, and removes the difference. You don't want to remove difference between different experimental conditions thought.

One thing you can do is try to make the data as similar as possible. If some samples have significantly more reads than others, subsample the larger files down to be comparable. If some samples are paired-end and some single-end, discard the read2s for the paired-end samples. If some reads are longer than others, trim everything to the length of the shortest reads. Quantify the samples using salmon with the switches --seqBias --gcBias --posBias.

Trouble is, its difficult to assess the success of this, because you actually want your susceptable samples to be distinct in a PCA plot from your other samples.

Probably you will need to do more lab work.

The ideal would be to just repeat the whole thing, but I can understand that that might not be an option. So here are two possibilities:

  • Go ahead with the data you have, select DE genes from your desired contrast. Now select, say, 10 genes at random (random is better than a biased manual selection) from your DE lists, and validate using qPCR. Or altherantively use qPCR to validate any indevidual genes you will hang biological conclusions on.

  • Resequence only part of the experiment. Your problem at the moment is that your batch is completely confounded with your condition. You can break this confound by sequencing a thrid batch that contains samples from both conditions. Say, two susceptable and two non-susceptatble samples. Batch correction techniques will then have something to get their teeth into.

ADD COMMENT
0
Entering edit mode

Thank you very much for your insights. Very well appreaciated.

Is clear to me that I need to break the confounding. My samples do have a bias of sequencing reads. Controls have way more reads than the treatments. Can you elaborate on

subsample the larger files down to be comparable

I think that could be a first step to diminish this bias.

ADD REPLY
0
Entering edit mode

If you are starting from fastq files, then the easiest way to do this is just with head:

So, to get the first 20M reads in a sample you would do:

$ zcat sample1_1.fastq.gz | head -n80000000 | gzip > sample1_1.subsample.fastq.gz
$ zcat sample1_2.fastq.gz | head -n80000000 | gzip > sample1_2.subsample.fastq.gz

Here I use 80000000 lines because each fastq record is 4 lines.

ADD REPLY
0
Entering edit mode

One probably should use a proper random sampling program like reformat.sh from BBMap or seqtk than simply taking initial 20M reads.

ADD REPLY
3
Entering edit mode
7 months ago

(+1 to Ian's answer, read that first) I'd like to point out a limitation that perhaps is under-appreciated in this and many RNAseq experiments, bulk or single-cell.

If you decide to have 2 or 3 replicates per condition you are implicitly making strong assumptions about the variability within and between conditions. Specifically, you assume that variability within condition is so low to be swamped by the between-condition effects. Alternatively, you are in the position to know whether a sample is acceptable for further analyses and you have a good sense of what reliable results should look like. For example, you have marker genes that indicate a given sample is behaving as expected and the output of differential expression can be well explained by background information. If you are willing to make these assumptions, it is unlikely that you will be overly worried by technical batch effects.

An extreme example to make the point: If you assign 3 random people to "control" and 3 to "treatment", a batch effect due to, say, sequencing platform is the least of your concerns. There are good chances that the three controls are all males and the treatments are all females, or three children vs three adults, or... you name it. The DGE machinery will do its job and give you differentially expressed genes, but unless you make the assumptions above the results will be dubious. Otherwise, increase your sample sizes and randomization will break these spurious links.

Incidentally, this is the reason why I'm not too outraged by experiments without replicates. If you are ok with 2 replicates you are already making assumption that also apply to single-replicate experiments. Two replicates enable the mathematical machinery but strong limitations still apply. (Anyway, aren't many scRNAseq experiments done without replication? And doesn't the integration step prevents following steps from accounting for sample variability? Disclaimer: I don't have much hands-on experience with scRNAseq).

It goes without saying that the more samples the better, 2 replicates are better (much better) than 1, and batch effects should be avoided or at least accounted for. However, I feel a bit uneasy when I hear categorical statements such that batch effects or no replication make an experiment useless. The risk with this message is that if you do have 2 or 3 replicates and no (known) batch effect you feel you are all good and go to over-interpret results.

ADD COMMENT

Login before adding your answer.

Traffic: 1366 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6