Batch correction for Nanopore RNAseq
1
0
Entering edit mode
17 months ago
Nico80 ▴ 80

I have 3 RNAseq experiments that were sequenced on a MinION and I am trying to do some differential expression analysis with them.

This is my pipeline:

  1. Basecalling with Guppy
  2. Cleaning using chopper
  3. Aligning to the genome (mouse) with minimap2
  4. Counting using featureCounts

At this points I wanted to import my counts table into R and run DESeq2 for DE analysis... however, when doing a PCA of the samples I get this

PCA by group

This seems to indicate no major difference between treatments. While this might as well be the case... we have done this experiments many times before and expect some very obvious difference (also, we have done short-reads RNAseq and we do see those differences there).

I then did a PCA by replicate, and this is what I saw

PCA by replicate

Now there seems to be quite a substantial batch effect, so I am wondering if anyone has suggestions on how to proceed. Would something like Combat-Seq be a good option?

nanopore batch-correction • 1.7k views
ADD COMMENT
1
Entering edit mode

Any ideas as to why this may be the case from experimental side of things? Were the sample prepared differently? Were they all sequenced on the same kind of FC/pore/chemistry and on the same/different minION?

ADD REPLY
0
Entering edit mode

Samples were prepared in the same way, but on different days. They were sequenced on the same minion but with different chips (for some reason we can't understand we cannot recover pores after washing the chips)

ADD REPLY
2
Entering edit mode

I would recommend barcoding and running one library on multiple flow cells, next time.

ADD REPLY
0
Entering edit mode

Yes, same pore version.

We barcoded the samples and run each experiment on a different flowcell. Are you suggesting to split all replicates on all flowcells instead?

ADD REPLY
0
Entering edit mode

Were the flowcells using the same pore versions?

ADD REPLY
1
Entering edit mode
17 months ago
LauferVA 4.5k

nico - PCA is a good start. What you are figuring out with a PCA plot is, roughly, whether there are clear differences among samples that might complicate an analysis, and, also, if any samples look anomalous.

Im both cases, what you need to determine to proceed is, what is the technical or biological phenomenon that is producing the difference?

One way to do this is to extensively annotate your data, then correlate your PC loadings against all the other potential covariates you have in your study. Consider the following code:

AddPcLoadingsToDESeqColData<-function(DESeq_Object) {
  rv<-rowVars(assay(DESeq_Object))
  select<-order(rv, decreasing=TRUE)[seq_len(min(1000, length(rv)))]            # select the ntop genes by variance
  pca<-prcomp(t(assay(DESeq_Object)[select, ]))                       # perform a PCA on the data in assay(x) for the selected genes
  percentVar<-pca$sdev^2 / sum( pca$sdev^2 )                        # the contribution to the total variance for each component
  PC_df_info<-data.frame(PC1=pca$x[,1])
  i=2
  while (percentVar[i] > 0.01) {
    PC_df_info[[ paste0("PC", as.character(i) ) ]]<-pca$x[, i]
    i = i+1
  }
  colData(DESeq_Object)<-cbind(colData(DESeq_Object), PC_df_info)
  metadata(DESeq_Object)$percentVar<-percentVar
  return(DESeq_Object)
}


AllColumnCorrelations<-function(InputDF) {
  InputDF <- InputDF %>% mutate_if(is.character,as.factor)
  InputDF <- InputDF %>% mutate_if(is.factor,as.numeric)
  InputDF <- InputDF %>% mutate_if(is.integer,as.numeric)
  r_values <- combn(seq_along(InputDF), 2, function(x) cor(InputDF[[x[1]]], InputDF[[x[2]]], method = 'pearson', use="complete.obs"))
  cov_names <- combn(names(InputDF), 2, paste0, collapse = '-')
  OutputDF <- data.frame(cov_names, r_values)
  OutputDF<-separate(data = OutputDF, col = cov_names, into = c("Cov1", "Cov2"), sep = "-")
  OutputDF$r_values<-round(OutputDF$r_values,4)
  OutputDF<-OutputDF[order(-abs(OutputDF$r_values)),]
  OutputDF<-OutputDF[abs(OutputDF$r_values) > 0.05,]
  return(OutputDF)
}

Given a matrix such as coldata(dds), this function will compute pearson's r for all pairs of columns. While this is not appropriate in all cases, it is a good sanity check to investigate whether the covariates you intend to include in your model are collinear or not.

# first add Pcs to the metadata:
dds<-AddPcLoadingsToDESeqColData(dds)
CollinearCovScan<-AllColumnCorrelations(as.data.frame(GlassMdUpdate)); CollinearCovScan 
# then find correlations between all pairs of Covs in colData(dds)

returns a Df formatted as follows:

CollinearCovScan[grepl( 'COX', paste(CollinearCovScan$Cov1, ' - ', CollinearCovScan$Cov2),ignore.case=TRUE),]
                                    Cov1                Cov2 r_values
2175                        COX4I1RawCts        COX4I1Status  -0.7720
2174                        COX4I1RawCts       COX4I1NormCts   0.7062
2141                          sizeFactor        COX4I1RawCts   0.6737
2206                       COX4I1NormCts        COX4I1Status  -0.6067
2143                          sizeFactor        COX4I1Status  -0.5464
2208                       COX4I1NormCts                 PC2  -0.4897
2237                        COX4I1Status                 PC1  -0.4292
778                   tumor_pair_barcode        COX4I1RawCts   0.4118

since you are attempting to get info. out of the PCA, in your case, you would be looking at whether the PC loadings for a given PC significantly correlate with a variable in your metadata.

If they do, the approach depends on how much covariance they have, how important the variable is to your model, etc. If including the PC in the model allows for control of variance that you suspect is a real (batch or other) effect that you otherwise would not be able to partial out, you could include as a covariate. Alternatively, suppose you have two groups of PC1 loadings, low and high. If there is a problem using that PC in the model, you might decide instead to stratify your model, then do hypothesis testing on the two groups of samples separately.

Finally, for datasets that you expect are still behaving badly for one reason or another, I recommend trying something like GSVA. Note that non-trivial effort has been devoted to deriving latent variables such as may account for batch effect. See, for instance, the gsva package.

Finally, for your specific case, the problem is that the point belonging to group C is just WAY out there. Not only is it not near other observations, but it is not even near to other members of group C. As such, I would likely observe it via distance matrices, and in other ways, to see if it conforms to your expectations for this type of data (despite its PCs). If it is otherwise OK, perhaps you keep it. However, if by contrast singling out this sample helps you to remeber that this was actually the sample that someone may have contaminated, now you have better reason to exclude it...

ADD COMMENT
0
Entering edit mode

Thanks for the answer, I'll definitely try and have a look at your suggestions and see if I can get something out of this.

ADD REPLY
0
Entering edit mode

great! i want to add that you can probably get most of the practical value out of the very last paragraph.

Other than the blue dot in the upper right corner, all of the other samples have PC1 and 2 loadings that approximate a line, which is commonly found when your data reflect some kind of spectrum, gradient, an ongoing process, etc.

So, I'd focus on that one sample. That should be pretty tractable to do.

ADD REPLY

Login before adding your answer.

Traffic: 2377 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6