Combining 2 different depth RNA-seq data (DESeq2)
0
0
Entering edit mode
3.3 years ago

Hello,

I have two RNA-seq data generated from Illumina Novaseq (same experimental design but different depth, 25M and 15M reads/sample for Run1 and Run2 respectively).

The dateset look like this:

Samples         Condition        Run
Sample_1            A             R1
Sample_2            B             R1
Sample_3            A             R1
Sample_4            B             R1
Sample_5            A             R1
Sample_6            B             R1
Sample_7            A             R2
Sample_8            B             R2

I want to do DE analysis using DESeq2. Since I have to analyze these samples together, I set a $run factor in my colData, and I try to use collapseReplicates() function to "collapse my technical replicates".

dds<-DESeqDataSetFromMatrix(count,coldata,design=~condition)
ddsColl <- collapseReplicates(dds, dds$sample, dds$run)

However, after merging these two dataset (R1&R2) by geneID, there are NAs in my count matrix due to different sequencing. For example, sample_5 & sample_6 are from run1 and sample_7 & sample_8 are from run 2:

gene.id         sample_5   Sample_6    Sample_7   Sample_8
gene_1            2             6         2          0
gene_2            3             0         0          0
gene_3            2             3         NA         NA 
gene_4            NA            NA        1          2

My question is: What should I do with these NAs? Is it appropriate to convert NAs into 0 (considering I will perform collapseReplicates() function)?

Please correct me if I am wrong in any point. Many thanks, Nicole

RNA-seq DESeq2 • 2.6k views
ADD COMMENT
1
Entering edit mode

~run+condition and set these NAs to zeros, how did these NAs even come up, that should not happen. What is the pipeline? Would be better to take the raw data from all samples and process 100% identically towards the quantification.

ADD REPLY
0
Entering edit mode

Hi ATpoint, The NAs came from different seq depth, because only about 60% gene.id were both detected in these two dataset (I merged the two read count matrix).

ADD REPLY
1
Entering edit mode

That must be zeros then, not NAs. No, don't merge fastq files. Process data 100% identically towards quantification, and then model batch as covariate as in my comment. If this gives worse results compared to leaving out that additional run then consider to simply drop it. See how many DEGs you get and how the PCA looks, see e.g. DESeq2 vignette for code suggestions.

ADD REPLY
0
Entering edit mode

Thank you! I will try

ADD REPLY
0
Entering edit mode

Or should I merge the fastq data instead of read count matrix?

ADD REPLY
0
Entering edit mode

I'm not sure whether it is necessary to collapse your replicates, does these replicates are technical? Instead, I suggest you to perform a PCA to inspect the scattering of your samples and the effect of the sequencing run. You could use the DESeq2::plotPCA() function or employ the PCAtools package.

ADD REPLY
0
Entering edit mode

Thanks for your reply. I think these two RNA-seq runs are technical replication, I will use PCA plot to see whether there is a strong batch effect. But I am still wondering what should I do with these NAs after I merged these two reads matrix.

ADD REPLY

Login before adding your answer.

Traffic: 1634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6