Consistency of Seurat SCTransform across computers/environments

Entering edit mode

4.9 years ago

Bastien Hervé 6.4k

Due to Covid19 situation I am working part time at the lab and at home. At the lab I am working under an HP workstation linux environment and at home on MSI with windows 10 and Rstudio.

As input to the SCTransform function I use the same RDS object, same parameters, same R version, same Seurat version, but I get slightly different output results which lead to different UMAPs.

I have set the same seed.use SCTransform's parameter in both environment.

From the lab :

library(Seurat)
SCRNARR.list <- readRDS(file = "/mnt/raid1/Data/SCRNARR/SCRNARR_BeforeSCT.rds")
object.size(SCRNARR.list)
944243832 bytes
for (sample in 1:length(SCRNARR.list)) {
    SCRNARR.list[[sample]] <- SCTransform(SCRNARR.list[[sample]], seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE)
}
There were 50 or more warnings (use warnings() to see the first 50)
object.size(SCRNARR.list)
2131213616 bytes

From home :

library(Seurat)
SCRNARR.list <- readRDS(file = "D://KI/VM/Data/SCRNARR/SCRNARR_BeforeSCT.rds")
object.size(SCRNARR.list)
944243832 bytes
for (sample in 1:length(SCRNARR.list)) {
    SCRNARR.list[[sample]] <- SCTransform(SCRNARR.list[[sample]], seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE)
}
There were 50 or more warnings (use warnings() to see the first 50)
object.size(SCRNARR.list)
2131212976 bytes

I went back up to the first diverging point between the two methods and SCTransform function is the first function outputing different object sizes leading to different UMAP conformation.

This issue does not come neither from the randomness of UMAP creation as at home I always get the same UMAP and in the mean time at the office I get the same UMAPs, but every time UMAPs from the office and UMAPs from home are different.

Could it be related to the computer itself, like how both computers are handling floating point ?

Thanks !

Specifications :

From the lab :

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=sv_SE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=sv_SE.UTF-8    LC_MESSAGES=en_US.UTF-8  
 [7] LC_PAPER=sv_SE.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] Seurat_3.2.0

loaded via a namespace (and not attached):
 [1] httr_1.4.2            tidyr_1.1.1           jsonlite_1.7.0      
 [4] viridisLite_0.3.0     splines_4.0.2         leiden_0.3.3        
 [7] shiny_1.5.0           ggrepel_0.8.2         globals_0.12.5      
[10] pillar_1.4.6          lattice_0.20-41       glue_1.4.1          
[13] reticulate_1.16       digest_0.6.25         polyclip_1.10-0      
[16] RColorBrewer_1.1-2    promises_1.1.1        colorspace_1.4-1    
[19] cowplot_1.0.0         htmltools_0.5.0       httpuv_1.5.4        
[22] Matrix_1.2-18         plyr_1.8.6            pkgconfig_2.0.3      
[25] listenv_0.8.0         purrr_0.3.4           xtable_1.8-4        
[28] patchwork_1.0.1       scales_1.1.1          RANN_2.6.1          
[31] tensor_1.5            later_1.1.0.1         Rtsne_0.15          
[34] spatstat.utils_1.17-0 tibble_3.0.3          mgcv_1.8-31          
[37] generics_0.0.2        ggplot2_3.3.2         ellipsis_0.3.1      
[40] ROCR_1.0-11           pbapply_1.4-3         lazyeval_0.2.2      
[43] deldir_0.1-28         survival_3.1-12       magrittr_1.5        
[46] crayon_1.3.4          mime_0.9              future_1.18.0        
[49] nlme_3.1-147          MASS_7.3-51.6         ica_1.0-2            
[52] tools_4.0.2           fitdistrplus_1.1-1    data.table_1.13.0    
[55] lifecycle_0.2.0       stringr_1.4.0         plotly_4.9.2.1      
[58] munsell_0.5.0         cluster_2.1.0         irlba_2.3.3          
[61] compiler_4.0.2        rsvd_1.0.3            rlang_0.4.7          
[64] grid_4.0.2            ggridges_0.5.2        goftest_1.2-2        
[67] RcppAnnoy_0.0.16      rappdirs_0.3.1        htmlwidgets_1.5.1    
[70] igraph_1.2.5          miniUI_0.1.1.1        gtable_0.3.0        
[73] codetools_0.2-16      abind_1.4-5           reshape2_1.4.4      
[76] R6_2.4.1              gridExtra_2.3         zoo_1.8-8            
[79] dplyr_1.0.2           uwot_0.1.8            fastmap_1.0.1        
[82] future.apply_1.6.0    KernSmooth_2.23-17    ape_5.4-1            
[85] spatstat.data_1.4-3   stringi_1.4.6         spatstat_1.64-1      
[88] parallel_4.0.2        Rcpp_1.0.5            rpart_4.1-15        
[91] vctrs_0.3.2           sctransform_0.2.1     png_0.1-7            
[94] tidyselect_1.1.0      lmtest_0.9-37

From home :

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C                   LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] Seurat_3.2.0

loaded via a namespace (and not attached):
 [1] httr_1.4.2            tidyr_1.1.2           jsonlite_1.7.0        viridisLite_0.3.0    
 [5] splines_4.0.2         leiden_0.3.3          shiny_1.5.0           ggrepel_0.8.2        
 [9] globals_0.12.5        pillar_1.4.6          lattice_0.20-41       glue_1.4.2          
[13] reticulate_1.16       digest_0.6.25         polyclip_1.10-0       RColorBrewer_1.1-2  
[17] promises_1.1.1        colorspace_1.4-1      cowplot_1.0.0         htmltools_0.5.0      
[21] httpuv_1.5.4          Matrix_1.2-18         plyr_1.8.6            pkgconfig_2.0.3      
[25] listenv_0.8.0         purrr_0.3.4           xtable_1.8-4          patchwork_1.0.1      
[29] scales_1.1.1          RANN_2.6.1            tensor_1.5            later_1.1.0.1        
[33] Rtsne_0.15            spatstat.utils_1.17-0 tibble_3.0.3          mgcv_1.8-31          
[37] generics_0.0.2        ggplot2_3.3.2         ellipsis_0.3.1        ROCR_1.0-11          
[41] pbapply_1.4-3         lazyeval_0.2.2        deldir_0.1-28         survival_3.1-12      
[45] magrittr_1.5          crayon_1.3.4          mime_0.9              future_1.18.0        
[49] nlme_3.1-148          MASS_7.3-51.6         ica_1.0-2             tools_4.0.2          
[53] fitdistrplus_1.1-1    data.table_1.13.0     lifecycle_0.2.0       stringr_1.4.0        
[57] plotly_4.9.2.1        munsell_0.5.0         cluster_2.1.0         irlba_2.3.3          
[61] compiler_4.0.2        rsvd_1.0.3            rlang_0.4.7           grid_4.0.2          
[65] ggridges_0.5.2        rstudioapi_0.11       goftest_1.2-2         RcppAnnoy_0.0.16    
[69] rappdirs_0.3.1        htmlwidgets_1.5.1     igraph_1.2.5          miniUI_0.1.1.1      
[73] gtable_0.3.0          codetools_0.2-16      abind_1.4-5           reshape2_1.4.4      
[77] R6_2.4.1              gridExtra_2.3         zoo_1.8-8             dplyr_1.0.2          
[81] uwot_0.1.8            fastmap_1.0.1         future.apply_1.6.0    KernSmooth_2.23-17  
[85] ape_5.4-1             spatstat.data_1.4-3   stringi_1.4.6         spatstat_1.64-1      
[89] parallel_4.0.2        Rcpp_1.0.5            rpart_4.1-15          vctrs_0.3.2          
[93] sctransform_0.2.1     png_0.1-7             tidyselect_1.1.0      lmtest_0.9-37

Seurat • 4.7k views

ADD COMMENT • link 4.9 years ago by Bastien Hervé 6.4k

Entering edit mode

4.9 years ago

Bastien Hervé 6.4k

After one week on this matter I think it is time to know when to stop digging. Here is where I went so far :

	options(digits=12)

	library(Seurat)
	library(sctransform)

	SCRNARR.list.HOME <- readRDS(file = "D://KI/VM/Data/SCRNARR/SCRNARR_JordaoObj_BeforeSCT.rds")
	SCRNARR.list.HOME3 <- SCRNARR.list.HOME[[3]]

	> head(SCRNARR.list.HOME3@meta.data,10)
	orig.ident nCount_RNA nFeature_RNA Tissu Number Plate Sample Replicate Age cellType percent.mt
	CRC-Ctrl_AAACCCACAGATACTC-1 CRC-Ctrl 4002 1764 SC 1 1 Ctrl 1 Unknown Unknown 10.56971514243
	CRC-Ctrl_AAACGAAAGAGACAAG-1 CRC-Ctrl 2597 1109 SC 1 1 Ctrl 1 Unknown Unknown 17.32768579130
	CRC-Ctrl_AAACGAAGTCAGGAGT-1 CRC-Ctrl 6553 2394 SC 1 1 Ctrl 1 Unknown Unknown 5.67678925683
	CRC-Ctrl_AAACGAATCTTCGGTC-1 CRC-Ctrl 11573 3499 SC 1 1 Ctrl 1 Unknown Unknown 11.40585846367
	CRC-Ctrl_AAACGCTTCAAACGTC-1 CRC-Ctrl 9235 2952 SC 1 1 Ctrl 1 Unknown Unknown 7.10341093665
	CRC-Ctrl_AAACGCTTCACCTGTC-1 CRC-Ctrl 8282 2766 SC 1 1 Ctrl 1 Unknown Unknown 6.18208162280
	CRC-Ctrl_AAAGGATAGTTACTCG-1 CRC-Ctrl 4381 1543 SC 1 1 Ctrl 1 Unknown Unknown 22.04976032869
	CRC-Ctrl_AAAGGATGTACTAACC-1 CRC-Ctrl 1089 659 SC 1 1 Ctrl 1 Unknown Unknown 3.94857667585
	CRC-Ctrl_AAAGGGCAGGGCCTCT-1 CRC-Ctrl 2890 1187 SC 1 1 Ctrl 1 Unknown Unknown 20.31141868512
	CRC-Ctrl_AAAGGGCCAGCAAGAC-1 CRC-Ctrl 10626 3093 SC 1 1 Ctrl 1 Unknown Unknown 7.61340109166

	HOME3 <- SCTransform(SCRNARR.list.HOME3, seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = TRUE)
	Warning in vst(umi = new("dgCMatrix", i = c(0L, 6L, 8L, 10L, 11L, 14L, 19L, :
	The 'show_progress' argument is deprecated as of v0.3. Use 'verbosity' instead.
	Calculating cell attributes from input UMI matrix: log_umi
	Variance stabilizing transformation of count matrix of size 14261 by 1415
	Model formula is y ~ log_umi
	Get Negative Binomial regression parameters per gene
	Using 2000 genes, 1415 cells
	\|==============================================================================================================================\| 100%
	Found 109 outliers - those will be ignored in fitting/regularization step

	Second step: Get residuals using fitted parameters for 14261 genes
	\|==============================================================================================================================\| 100%
	Computing corrected count matrix for 14261 genes
	\|==============================================================================================================================\| 100%
	Calculating gene attributes
	Wall clock passed: Time difference of 27.5540840626 secs
	Determine variable features
	Set 2000 variable features
	Place corrected count matrix in counts slot
	Regressing out percent.mt
	\|==============================================================================================================================\| 100%
	Centering data matrix
	\|==============================================================================================================================\| 100%
	Set default assay to SCT
	There were 50 or more warnings (use warnings() to see the first 50)


	> head(HOME3@assays$SCT@meta.features,10)
	sct.detection_rate sct.gmean sct.variance sct.residual_mean sct.residual_variance sct.variable
	Xkr4 0.04664310954064 0.03404338227947 0.0527136509714 0.00166863375169 0.895493795854 FALSE
	Sox17 0.00424028268551 0.00466047388371 0.0246680094562 0.03261793545970 3.078222991350 TRUE
	Mrpl15 0.21696113074205 0.18299618913749 0.3028303537067 -0.02188086140187 0.840942518347 FALSE
	Lypla1 0.09187279151943 0.07007620591838 0.1174974135475 -0.00329173686412 0.961279361659 FALSE
	Tcea1 0.42332155477032 0.49515519095694 1.2785102033676 -0.03470735655467 0.763613832523 FALSE
	Rgs20 0.01201413427562 0.00836233390640 0.0118781893333 0.03035045264758 1.415116394148 FALSE
	Atp6v1h 0.25017667844523 0.22202650479581 0.3996101578861 -0.01523608574277 0.888377211287 FALSE
	Rb1cc1 0.30530035335689 0.31316853317088 0.8240562572158 0.02399767955136 1.155274022331 FALSE
	4732440D04Rik 0.06501766784452 0.04901229712166 0.0824945896912 0.02841303554101 1.362626707355 FALSE
	St18 0.35971731448763 0.40934610049877 1.1449113109191 0.05118676291799 1.321781313407 FALSE

view raw HOME.R hosted with ❤ by GitHub

	options(digits=12)

	library(Seurat)
	library(sctransform)

	SCRNARR.list.LAB <- readRDS(file = "/mnt/raid1/Data/SCRNARR/SCRNARR_JordaoObj_BeforeSCT.rds")
	SCRNARR.list.LAB3 <- SCRNARR.list.LAB[[3]]

	head(SCRNARR.list.LAB3@meta.data,10)

	LAB3 <- SCTransform(SCRNARR.list.LAB3, seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = TRUE)


	> head(SCRNARR.list.LAB3@meta.data,10)
	orig.ident nCount_RNA nFeature_RNA Tissu Number
	CRC-Ctrl_AAACCCACAGATACTC-1 CRC-Ctrl 4002 1764 SC 1
	CRC-Ctrl_AAACGAAAGAGACAAG-1 CRC-Ctrl 2597 1109 SC 1
	CRC-Ctrl_AAACGAAGTCAGGAGT-1 CRC-Ctrl 6553 2394 SC 1
	CRC-Ctrl_AAACGAATCTTCGGTC-1 CRC-Ctrl 11573 3499 SC 1
	CRC-Ctrl_AAACGCTTCAAACGTC-1 CRC-Ctrl 9235 2952 SC 1
	CRC-Ctrl_AAACGCTTCACCTGTC-1 CRC-Ctrl 8282 2766 SC 1
	CRC-Ctrl_AAAGGATAGTTACTCG-1 CRC-Ctrl 4381 1543 SC 1
	CRC-Ctrl_AAAGGATGTACTAACC-1 CRC-Ctrl 1089 659 SC 1
	CRC-Ctrl_AAAGGGCAGGGCCTCT-1 CRC-Ctrl 2890 1187 SC 1
	CRC-Ctrl_AAAGGGCCAGCAAGAC-1 CRC-Ctrl 10626 3093 SC 1
	Plate Sample Replicate Age cellType
	CRC-Ctrl_AAACCCACAGATACTC-1 1 Ctrl 1 Unknown Unknown
	CRC-Ctrl_AAACGAAAGAGACAAG-1 1 Ctrl 1 Unknown Unknown
	CRC-Ctrl_AAACGAAGTCAGGAGT-1 1 Ctrl 1 Unknown Unknown
	CRC-Ctrl_AAACGAATCTTCGGTC-1 1 Ctrl 1 Unknown Unknown
	CRC-Ctrl_AAACGCTTCAAACGTC-1 1 Ctrl 1 Unknown Unknown
	CRC-Ctrl_AAACGCTTCACCTGTC-1 1 Ctrl 1 Unknown Unknown
	CRC-Ctrl_AAAGGATAGTTACTCG-1 1 Ctrl 1 Unknown Unknown
	CRC-Ctrl_AAAGGATGTACTAACC-1 1 Ctrl 1 Unknown Unknown
	CRC-Ctrl_AAAGGGCAGGGCCTCT-1 1 Ctrl 1 Unknown Unknown
	CRC-Ctrl_AAAGGGCCAGCAAGAC-1 1 Ctrl 1 Unknown Unknown
	percent.mt
	CRC-Ctrl_AAACCCACAGATACTC-1 10.56971514243
	CRC-Ctrl_AAACGAAAGAGACAAG-1 17.32768579130
	CRC-Ctrl_AAACGAAGTCAGGAGT-1 5.67678925683
	CRC-Ctrl_AAACGAATCTTCGGTC-1 11.40585846367
	CRC-Ctrl_AAACGCTTCAAACGTC-1 7.10341093665
	CRC-Ctrl_AAACGCTTCACCTGTC-1 6.18208162280
	CRC-Ctrl_AAAGGATAGTTACTCG-1 22.04976032869
	CRC-Ctrl_AAAGGATGTACTAACC-1 3.94857667585
	CRC-Ctrl_AAAGGGCAGGGCCTCT-1 20.31141868512
	CRC-Ctrl_AAAGGGCCAGCAAGAC-1 7.61340109166

	Warning in vst(umi = new("dgCMatrix", i = c(0L, 6L, 8L, 10L, 11L, 14L, 19L, :
	The 'show_progress' argument is deprecated as of v0.3. Use 'verbosity' instead.
	Calculating cell attributes from input UMI matrix: log_umi
	Variance stabilizing transformation of count matrix of size 14261 by 1415
	Model formula is y ~ log_umi
	Get Negative Binomial regression parameters per gene
	Using 2000 genes, 1415 cells
	\|======================================================================\| 100%
	Found 109 outliers - those will be ignored in fitting/regularization step

	Second step: Get residuals using fitted parameters for 14261 genes
	\|======================================================================\| 100%
	Computing corrected count matrix for 14261 genes
	\|======================================================================\| 100%
	Calculating gene attributes
	Wall clock passed: Time difference of 21.533244133 secs
	Determine variable features
	Set 2000 variable features
	Place corrected count matrix in counts slot
	Regressing out percent.mt
	\|======================================================================\| 100%
	Centering data matrix
	\|======================================================================\| 100%
	Set default assay to SCT
	There were 50 or more warnings (use warnings() to see the first 50)


	> head(LAB3@assays$SCT@meta.features, 10)
	sct.detection_rate sct.gmean sct.variance
	Xkr4 0.04664310954064 0.03404338227947 0.0527136509714
	Sox17 0.00424028268551 0.00466047388371 0.0246680094562
	Mrpl15 0.21696113074205 0.18299618913749 0.3028303537067
	Lypla1 0.09187279151943 0.07007620591838 0.1174974135475
	Tcea1 0.42332155477032 0.49515519095694 1.2785102033676
	Rgs20 0.01201413427562 0.00836233390640 0.0118781893333
	Atp6v1h 0.25017667844523 0.22202650479581 0.3996101578861
	Rb1cc1 0.30530035335689 0.31316853317088 0.8240562572158
	4732440D04Rik 0.06501766784452 0.04901229712166 0.0824945896912
	St18 0.35971731448763 0.40934610049877 1.1449113109191
	sct.residual_mean sct.residual_variance sct.variable
	Xkr4 0.00166480151548 0.895449988581 FALSE
	Sox17 0.03266477230429 3.078487641011 TRUE
	Mrpl15 -0.02188134713981 0.840939673488 FALSE
	Lypla1 -0.00331010685490 0.961113820969 FALSE
	Tcea1 -0.03470735655467 0.763613832523 FALSE
	Rgs20 0.03048786789515 1.417992957444 FALSE
	Atp6v1h -0.01523621373206 0.888376514993 FALSE
	Rb1cc1 0.02399767955136 1.155274022331 FALSE
	4732440D04Rik 0.02839262075989 1.362361524466 FALSE
	St18 0.05118676291798 1.321781313407 FALSE

view raw LAB.R hosted with ❤ by GitHub

You can notice that both objects before SCTransform are exactly similar, I have outputted 12 digits in this regards and the percent.mt columns are identicals.

I have investigate the SCTransform function end each variable are identicals in both methods before the vst call.

	#Part of SCTransform function
	if (conserve.memory) {
	vst.args[["residual_type"]] <- "none"
	vst.out <- do.call(what = "vst", args = vst.args)
	feature.variance <- get_residual_var(vst_out = vst.out,
	umi = umi, residual_type = residual.type, res_clip_range = res.clip.range)
	vst.out$gene_attr$residual_variance <- NA_real_
	vst.out$gene_attr[names(x = feature.variance), "residual_variance"] <- feature.variance
	}
	else {
	vst.out <- do.call(what = "vst", args = vst.args)
	feature.variance <- setNames(object = vst.out$gene_attr$residual_variance,
	nm = rownames(x = vst.out$gene_attr))
	}

view raw SCTransform.R hosted with ❤ by GitHub

So here I am diving into the vst function trying to catch the moment where one variable is different between both computers, and I have found the reg_model_pars function.

	#Part of SCTransform function
	if (conserve.memory) {
	vst.args[["residual_type"]] <- "none"
	vst.out <- do.call(what = "vst", args = vst.args)
	feature.variance <- get_residual_var(vst_out = vst.out,
	umi = umi, residual_type = residual.type, res_clip_range = res.clip.range)
	vst.out$gene_attr$residual_variance <- NA_real_
	vst.out$gene_attr[names(x = feature.variance), "residual_variance"] <- feature.variance
	}
	else {
	vst.out <- do.call(what = "vst", args = vst.args)
	feature.variance <- setNames(object = vst.out$gene_attr$residual_variance,
	nm = rownames(x = vst.out$gene_attr))
	}

view raw reg_model_pars.R hosted with ❤ by GitHub

Each given parameters are strictly the same in both methods but the outputted table model_pars_fit has different digits results.

You can see a start of digit unconsistensy in the theta column of the model_pars_fit object where the 14th digit can sometime be different in both conditions.

#From home
Browse[2]> head(model_pars)
                   theta     (Intercept)        log_umi
Gm996   0.32357670929216  -7.27629425513 1.200396174984
Miip    1.36366857498143 -10.81004819372 2.049307632658
Gatm    2.83584888603177  -9.37186401727 3.010610338152
Gm42670 0.04096150125472 -14.57765295644 2.607220585029
Igfbp6  0.00260318017233 -10.27534471280 1.474478130387
Lrrk2   0.03483076688676  -8.36917341648 0.922622854941

Browse[2]> head(model_pars_fit)
                 theta    (Intercept)       log_umi
Xkr4   0.3387578388903 -10.7343273015 2.04951110982
Sox17  0.0448549826806 -11.2091790634 1.65706231561
Mrpl15 2.7704770356570 -10.6195816599 2.45828041253
Lypla1 1.1300709637094 -10.9628137308 2.30163275018
Tcea1  4.6505611537629 -10.5167181216 2.68957605843
Rgs20  0.0639445102836 -10.8614308264 1.71239484860

#From the lab
Browse[2]> head(model_pars)
                   theta     (Intercept)        log_umi
Gm996   0.32357670929215  -7.27629425513 1.200396174984
Miip    1.36366857498148 -10.81004819372 2.049307632658
Gatm    2.83584888603177  -9.37186401727 3.010610338152
Gm42670 0.04096150125472 -14.57765295644 2.607220585029
Igfbp6  0.00260318017233 -10.27534471280 1.474478130387
Lrrk2   0.03483076688676  -8.36917341648 0.922622854941

Browse[2]> head(model_pars_fit)
                 theta    (Intercept)       log_umi
Xkr4   0.3387340069401 -10.7337639999 2.04936611746
Sox17  0.0448925740448 -11.1244329222 1.63384459020
Mrpl15 2.7704551684688 -10.6195337182 2.45826806094
Lypla1 1.1296301798296 -10.9605802435 2.30105731963
Tcea1  4.6505611537629 -10.5167181216 2.68957605843
Rgs20  0.0638077258829 -10.8789609804 1.71695512126

But after the reg_model_pars function the unconsistancy can appears at the tenth.

So once again I wanted to dig into reg_model_pars function. I was able to track the changes to this function.

model_pars_fit[o, 'dispersion_par'] <- ksmooth(x = genes_log_gmean_step1, y = model_pars[, 'dispersion_par'], x.points = x_points, bandwidth = bw, kernel='normal')$y

Which is a call to a C function and where I lost the track of the calculation process...

	function (x, y, kernel = c("box", "normal"), bandwidth = 0.5,
	range.x = range(x), n.points = max(100L, length(x)), x.points)
	{
	if (missing(y) \|\| is.null(y))
	stop("numeric y must be supplied.\nFor density estimation use density()")
	kernel <- match.arg(kernel)
	krn <- switch(kernel, box = 1L, normal = 2L)
	x.points <- if (missing(x.points))
	seq.int(range.x[1L], range.x[2L], length.out = n.points)
	else {
	n.points <- length(x.points)
	sort(x.points)
	}
	ord <- order(x)
	.Call(C_ksmooth, x[ord], y[ord], x.points, krn, bandwidth)
	}

view raw ksmooth.R hosted with ❤ by GitHub

I am really sad not to be able to fully understand what's going on here but I am more and more convinced by a digit handling issue.

ADD COMMENT • link 4.9 years ago by Bastien Hervé 6.4k

Entering edit mode

4.9 years ago

ATpoint 89k

What are the warnings?

I would start by checking towards SCtransform:

are the selected highly-variable genes (HVGs) exactly the same between the platforms
are the obtained Pearson residuals from SCtransform the same. I guess would be enough to check on a few samples

If both is yes (and I would expect that) then the issue is downstream. Downstream means PCA and clustering algorithms. I think Seurat uses by default Louvain clustering based on Jaccard's similarity, that is (I think) deterministic but not really sure. UMAP itself is not and will change depending on Seed, but afaik (I do not use Seurat though) there is a constant internal seed for all Seurat functions. Same should go (not sure) for PCA. At least the workflow (Bioconductor, OSCA) I follow always explicitely sets a seed prior to running PCA and clustering.

I mean, it looks like the clustering is basically the same towards the overall picture, just a bit rotated, isn't it? You have 7 clusters each with four of them forming that big Africa-shaped cloud and the other three a bit scattered around. That indicates that not a basic parameter (such as HVGs and residuals) changes but rather something more subtle, e.g. a Seed or so. Try to set explicit seeds for all downstream functions (if possible).

ADD COMMENT • link 4.9 years ago by ATpoint 89k

Entering edit mode

Thank you, really interesting thinking ! I dug a bit more into SCTransform and seems like the HVGs are not always the same, see below.

Warnings are the same in both cases, the same line is repeated except for warning 31 :

#Warnings
29: In theta.ml(y = y, mu = fit$fitted) : iteration limit reached
30: In theta.ml(y = y, mu = fit$fitted) : iteration limit reached
31: In sqrt(1/i) : production de NaN
32: In theta.ml(y = y, mu = fit$fitted) : iteration limit reached
33: In theta.ml(y = y, mu = fit$fitted) : iteration limit reached

Checking towards SCtransform

I have 7 different samples in SCRNARR.list

SCRNARR.list.LAB <- readRDS(file = "/mnt/raid1/Data/SCRNARR/SCRNARR_AfterSCT_LAB.rds")
object.size(SCRNARR.list.LAB)
2131213616 bytes

SCRNARR.list.HOME <- readRDS(file = "/mnt/raid1/Data/SCRNARR/SCRNARR_AfterSCT_HOME.rds")
object.size(SCRNARR.list.HOME)
2131212976 bytes

#For LAB
HVGs.LAB = list()
for (sample in 1:length(SCRNARR.list.LAB)) {
    HVGs.LAB[[sample]] <- rownames(SCRNARR.list.LAB[[sample]]@assays$SCT@meta.features[SCRNARR.list.LAB[[sample]]@assays$SCT@meta.features$sct.variable,])
}

mean.list.LAB=list()
for (sample in 1:length(SCRNARR.list.LAB )) {
    mean.list.LAB[[sample]] <- SCRNARR.list.LAB[[sample]]@assays$SCT@meta.features$sct.residual_mean
}

variance.list.LAB=list()
for (sample in 1:length(SCRNARR.list.LAB)) {
    variance.list.LAB[[sample]] <- SCRNARR.list.LAB[[sample]]@assays$SCT@meta.features$sct.residual_variance
}

#For HOME
HVGs.HOME = list()
for (sample in 1:length(SCRNARR.list.HOME)) {
    HVGs.HOME[[sample]] <- rownames(SCRNARR.list.HOME[[sample]]@assays$SCT@meta.features[SCRNARR.list.HOME[[sample]]@assays$SCT@meta.features$sct.variable,])
}

mean.list.HOME=list()
for (sample in 1:length(SCRNARR.list.HOME)) {
    mean.list.HOME[[sample]] <- SCRNARR.list.HOME[[sample]]@assays$SCT@meta.features$sct.residual_mean
}

variance.list.HOME=list()
for (sample in 1:length(SCRNARR.list.HOME)) {
    variance.list.HOME[[sample]] <- SCRNARR.list.HOME[[sample]]@assays$SCT@meta.features$sct.residual_variance
}

#Present in HOME but not in LAB
for (sample in 1:length(HVGs.HOME)) {
    print(setdiff(HVGs.HOME[[sample]], HVGs.LAB[[sample]]))
}

character(0)
character(0)
[1] "Sntg1"   "Tgfbr3l" "Apoa1"  
character(0)
character(0)
character(0)
[1] "Arhgap29" "Gm36888"  "Helz"     "Nrg1"     "Zfp369"   "Zmym6"

#Present in LAB but not in HOME
for (sample in 1:length(HVGs.HOME)) {
    print(setdiff(HVGs.LAB[[sample]], HVGs.HOME[[sample]]))
}

character(0)
character(0)
[1] "Cass4"   "Galnt17" "Grid2"  
character(0)
character(0)
character(0)
[1] "Aldh5a1" "Gm27162" "Gm36969" "Gm37452" "Ung"     "Wfdc18"


for (sample in 1:length(mean.list.LAB)) {
    print(sample)
    print(sum(mean.list.HOME[[sample]]))
    print(sum(mean.list.LAB[[sample]]))
}

[1] 1
[1] 302.9477
[1] 302.9477
[1] 2
[1] 373.7752
[1] 373.7752
[1] 3
[1] 203.9835
[1] 204.2463
[1] 4
[1] 270.4948
[1] 270.4948
[1] 5
[1] 154.8764
[1] 154.8764
[1] 6
[1] 247.2674
[1] 247.2674
[1] 7
[1] 196.836
[1] 198.1857

for (sample in 1:length(variance.list.LAB)) {
    print(sample)
    print(sum(variance.list.HOME[[sample]]))
    print(sum(variance.list.LAB[[sample]]))
}

[1] 1
[1] 25587.02
[1] 25587.02
[1] 2
[1] 30130.43
[1] 30130.43
[1] 3
[1] 20651.4
[1] 20649.48
[1] 4
[1] 25024.74
[1] 25024.74
[1] 5
[1] 19100.73
[1] 19100.73
[1] 6
[1] 18098.05
[1] 18098.05
[1] 7
[1] 16157.47
[1] 16170.22

I only plot UMAPs for sample from 1 to 4 and the screenshot in the main thread is coming from the sample 3...

I finished the analysis for both LAB and HOME objects after SCTransform on my office workstation.

For the sample 3, LAB object outputted Africa-shaped cluster in the good way and for the HOME object I got Africa-shaped cluster in reverse

How can the SCTransform function choose different HVGs from the exact same input objects, parameters ?

ADD REPLY • link 4.9 years ago by Bastien Hervé 6.4k

Entering edit mode

How can the SCTransform function choose different HVGs from the exact same input objects, parameters ?

Hmm, good question. Are results the same if you run several times on the same workstation, maybe test on a single sample. Just mclapply it 10 times, and see whether the function itself is fully deterministic.

ADD REPLY • link 4.9 years ago by ATpoint 89k

Entering edit mode

Yeah the objects are the same size every time at the lab and every time the same size at home but objects size are different from the lab compare to the ones I get from home.

mclapply(1:10, function(i) {
    object.size(SCTransform(SCRNARR.LAB, seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE))
})

[[1]]
159113712 bytes

[[2]]
159113712 bytes

[[3]]
159113712 bytes

[[4]]
159113712 bytes

mclapply(1:10, function(i) {
    object.size(SCTransform(SCRNARR.HOME, seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE))
})

[[1]]
159113040 bytes

[[2]]
159113040 bytes

[[3]]
159113040 bytes

[[4]]
159113040 bytes

ADD REPLY • link 4.9 years ago by Bastien Hervé 6.4k

Entering edit mode

I would check content, not size. Size is almost never a good indicator for anything:

> object.size(x = c(1,2))
64 bytes
> object.size(x = c(7,3))
64 bytes

ADD REPLY • link 4.9 years ago by ATpoint 89k

Entering edit mode

apply10SCTransform.HOME = list()
apply10SCTransform.HOME <- mclapply(1:10, function(i) {
    SCTransform(SCRNARR.HOME, seed.use=1447854, variable.features.n = 2000, vars.to.regress = "percent.mt", verbose = FALSE)
})
for (n in 1:9){
    print(setdiff(rownames(apply10SCTransform.HOME[[n]]@assays$SCT@meta.features[apply10SCTransform.HOME[[n]]@assays$SCT@meta.features$sct.variable,]), rownames(apply10SCTransform.HOME[[n+1]]@assays$SCT@meta.features[apply10SCTransform.HOME[[n+1]]@assays$SCT@meta.features$sct.variable,])))
}
character(0)
character(0)
character(0)
character(0)
character(0)
character(0)
character(0)
character(0)
character(0)

ADD REPLY • link 4.9 years ago by Bastien Hervé 6.4k