Hello everyone,
I am an absolute beginner on sequencing analysis and DESeq2, so please forgive me for possibly mundane questions. I have tried to look up different methods, but couldn't find a fitting answer yet.
I am currently working with sequencing data derived from an Illumina sequencer. The data is divided in groups by genotype and timepoint and I am comparing 1 to 1 each time.
My first issue derives from the fact, that the samples were run in two batches. Batch 1 was sequenced as single-end 75 bp, while Batch 2 was done using paired-end 150 bp. I have manually down-scaled the 150 bp paired data with Trimmomatic and BBMap to conform to my 20 mio, 75 single-end reads before using Salmon to generate quant files.
Question 1: Will this be an issue later on and if yes, are there better alternatives?
My second issue derives from PCA results I receive after running the comparisons with DESeq2 as
ddsTxi <- DESeqDataSetFromTximport(txi, colData = samples_subset, design = ~group)
dds <- results(dds)
vsd <- vst(dds, blind = T)
PCA <- plotPCA(vsd, intgroup=c('group'))
The resulting image shows a distinctive batch effect between the two data sets. My first thought was changing the DESeq2 internal design using
design = ~group + batch
And while the Volcano plot, which I also generate from res <- results(dds)
changes, the PCA plot is unfazed. Another option I have found was using limma's removeBatchEffect(). This I run for the vsd
variable after DESeq2 and it definitely improves my PCA results.
The issue lies therein, that removeBatchEffect()
only affects the vds variable which is passed to plotPCA
. My normalized count table, my TPM tables, as well as the DESeq2 data output are unaffected by these changes.
This should definitely give problems with downstream GSEA as it doesn't affect the overall data?
Question 2: How can I apply batch correction to my data, so that I have conclusive results downstream correcting for my TPM, normCount and DESeq result table?
Thank you everyone for your time
Relevant Code:
txi <- tximport(files, type = salmon, ignoreTxVersion=TRUE, tx2gene = tx2gene)
all_TPM <- txi$abundance
colnames(all_TPM) <- samples$tag
all_counts <- txi$counts
colnames(all_Counts) <- samples$tag
all_Log2TPM <- log2(all_TPM+1)
ddsTxi <- DESeqDataSetFromTximport(txi, colData = samples_subset, design = ~group + batch)
ddsTxi$group <- factor(ddsTxi$group, levels = c(groupA, groupB))
dds <- results(dds)
vsd <- vst(dds, blind = F)
assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)
PCA <- plotPCA(vsd, intgroup=c('group'))
[...] #Setting up all stuff for PCA output
volcano = EnhancedVolcano(res, lab = rownames(res), x = 'log2FoldChange', y = 'pValue')
[...] #Setting up stuff for Volcano output
normCounts <- as.data.frame(counts(dds, normalized = TRUE))
[...] #Output normCount csv
resdf <- as.data.frame(res)
resdf$padj[is.na(resdf$padj)] <- 1
[...] #Output DESeq results csv
Is only the sequencing mode the batch or have libraries prepared in different timepoints/batches? If it's only the sequencing itself then treating the paired-end sample as single-end (only using R1) and trimming them all to the same read length should eliminate any batch effect.
Sadly the second batch is an entire new library prep, as the samples were mismatched and had to be fully re-done. As such trimming and single-end treating the reads is not eliminating the batch effect, just reducing it by 1 stage