Hello,
I have two RNA-seq data generated from Illumina Novaseq (same experimental design but different depth, 25M and 15M reads/sample for Run1 and Run2 respectively).
The dateset look like this:
Samples Condition Run
Sample_1 A R1
Sample_2 B R1
Sample_3 A R1
Sample_4 B R1
Sample_5 A R1
Sample_6 B R1
Sample_7 A R2
Sample_8 B R2
I want to do DE analysis using DESeq2. Since I have to analyze these samples together, I set a $run factor in my colData, and I try to use collapseReplicates() function to "collapse my technical replicates".
dds<-DESeqDataSetFromMatrix(count,coldata,design=~condition)
ddsColl <- collapseReplicates(dds, dds$sample, dds$run)
However, after merging these two dataset (R1&R2) by geneID, there are NAs in my count matrix due to different sequencing. For example, sample_5 & sample_6 are from run1 and sample_7 & sample_8 are from run 2:
gene.id sample_5 Sample_6 Sample_7 Sample_8
gene_1 2 6 2 0
gene_2 3 0 0 0
gene_3 2 3 NA NA
gene_4 NA NA 1 2
My question is: What should I do with these NAs? Is it appropriate to convert NAs into 0 (considering I will perform collapseReplicates() function)?
Please correct me if I am wrong in any point. Many thanks, Nicole
~run+condition
and set these NAs to zeros, how did these NAs even come up, that should not happen. What is the pipeline? Would be better to take the raw data from all samples and process 100% identically towards the quantification.Hi ATpoint, The NAs came from different seq depth, because only about 60% gene.id were both detected in these two dataset (I merged the two read count matrix).
That must be zeros then, not NAs. No, don't merge fastq files. Process data 100% identically towards quantification, and then model batch as covariate as in my comment. If this gives worse results compared to leaving out that additional run then consider to simply drop it. See how many DEGs you get and how the PCA looks, see e.g. DESeq2 vignette for code suggestions.
Thank you! I will try
Or should I merge the fastq data instead of read count matrix?
I'm not sure whether it is necessary to collapse your replicates, does these replicates are technical? Instead, I suggest you to perform a PCA to inspect the scattering of your samples and the effect of the sequencing run. You could use the
DESeq2::plotPCA()
function or employ thePCAtools
package.Thanks for your reply. I think these two RNA-seq runs are technical replication, I will use PCA plot to see whether there is a strong batch effect. But I am still wondering what should I do with these NAs after I merged these two reads matrix.