Question

Weird results for scRNA-seq

0

Entering edit mode

23 months ago

tujuchuanli ▴ 130

Hi, I wanted to analyze the scRNA-seq data on GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE190772). It is the data for breast cancer bone metastases. The data is log transformed counts. I did it as following. There is no err occurred during processing. The Umap looks OK and I can see the separated cell populations on the Umap plot. However, the cluster annotation is weird. The clusters was mixed. Every separated cell population on the Umap plot was comprised cells from almost all clusters. Do you know how to figure it out? Thanks

Belowing is my code:

data = read.table(file = "GSE190772_BoM_logCounts.txt", sep = "\t", header = T)
BoM_SO = CreateSeuratObject(counts = data, min.cells = 3, min.features = 300, project = "BoM")

BoM_SO.filter = subset(BoM_SO, subset = nFeature_RNA > 400 & nFeature_RNA < 8000 & percent.ribo < 0.2 & percent.mito < 0.2)
BoM_SO.filter@assays$RNA@data = as.matrix(data)
BoM_SO.filter@misc[["seurat_data"]] = as.matrix(BoM_SO.filter@assays$RNA@data) 
BoM_SO.filter = FindVariableFeatures(object = BoM_SO.filter, mean.function = ExpMean, dispersion.function = LogVMR, dispersion.cutoff = c(0.05, 5), nfeatures = 3000)
BoM_SO.filter = ScaleData(object = BoM_SO.filter, do.scale = T,do.center = T)

BoM_SO.filter = RunPCA(BoM_SO.filter)
BoM_SO.filter = FindNeighbors(BoM_SO.filter, dims = 1:50)
BoM_SO.filter = FindClusters(BoM_SO.filter, resolution = 1)
BoM_SO.filter = RunUMAP(BoM_SO.filter, dims = 1:50, do.fast = TRUE)
DimPlot(BoM_SO.filter, label = T, sizes.highlight = 5, pt.size = 1)

scRNA-seq GEO • 3.2k views

ADD COMMENT • link updated 23 months ago by Giuseppe ▴ 80 • written 23 months ago by tujuchuanli ▴ 130

1

Entering edit mode

can you show us the plot?

ADD REPLY • link 23 months ago by jv ★ 1.8k

0

Entering edit mode

Hi jv,

Bellowing is my Umap plot. You can see the well separated cell population and mixed cluster annotations.

enter image description here

ADD REPLY • link 23 months ago by tujuchuanli ▴ 130

0

Entering edit mode

it is clear that DimPlot is not coloring the points be cluster identity. This is all the more clear by NA in the legend. I don't know what default metadata vector DimPlotuses, but you can specify which metadata contains the clusters by setting the group.by parameter. As a first pass, try group.by = 'seurat_clusters'. Else, examine your meta data before running DimPlot to see which vector contains the cluster ids.

ADD REPLY • link 23 months ago by jv ★ 1.8k

0

Entering edit mode

Hi jv,

Thank you very much to provide precise suggestion. I checked the metadata table (BoM_SO.filter@meta.data) and found that many “NA”s were assigned to the cells at the end of the Seurat_clusters (BoM_SO.filter@meta.data$Seurat_clusters). I don`t know what happen to the Seurat_clusters. However the cluster ids in BoM_SO.filter@meta.data$RNA_snn_res.1 seemed OK and group by RNA_snn_res.1 in Umap was also normal.

I compared the cluster id in Seurat_clusters and RNA_snn_res.1 and found that it seemed that some cluster ids in Seurat_clusters were omitted then the resting cluster ids were directly appended to the omitted position. For example the cluster id in RNA_snn_res.1 is “1, 2, 3, 4, 5, 6, 7” for cell A, B, C, D, E, F and G. However, in Seurat_clusters the order is “1, 2, 4, 5, 6, 7, NA” for cell A, B, C, D, E, F and G.

I am not sure that I used RNA_snn_res.1 instead of Seurat_clusters in Umap is right and what happen to Seurat_clusters. Do you know how can I make sure of it?

ADD REPLY • link 23 months ago by tujuchuanli ▴ 130

0

Entering edit mode

Hi, there are a few things that I don't understand from your code.

You are working with log-transformed counts, which I assume are also normalized (this is a big assumption). So, again in theory and based on assumptions, they should have been filtered by the authors.

You input it in a Seurat object as counts and then you filter the object. Then you reassign the original log count table - that is, before filtering - to the object's data slot. You also assigned to another slot in misc called "seurat_data". Why? It doesn't sound like Seurat should even allow you to do that (in theory you are supplying a count table with a smaller number of cells than the rest of the object references).

Moreover by passing it first to as.matrix() you are causing two issues: 1) you are replacing an otherwise sparse matrix with a dense one, making the object balloon in size for apparently no reason. 2) issue 1 is not a big problem in itself, but in this specific case this table has the gene symbol (a character vector) at the end of the table. So by transforming the data into a matrix, you are effectively coercing all your numerics to become characters ("0.00000" instead of 0). A character matrix is not going to get you anywhere.

I downloaded the file from GEO and applied the same workflow and it threw an error. When omitting the problematic lines (the ones where you replace the data and the misc slots) it runs well with no errors (see UMAP). However, you still miss the gene symbols by doing this.

You should do first:

library(Matrix)

rownames(data) = data$gene
data$gene = NULL
# here we convert the data frame to a sparse matrix
data = Matrix(as.matrix(data), sparse = TRUE)

Only then you build your Seurat object:

BoM_SO = CreateSeuratObject(counts = data, min.cells = 3, min.features = 300, project = "BoM")

and you can go on with filtering, variable features, scaling and PCA, etc and can even plot genes now that we have them in the row names.

UMAP labelled by clusters

Some gene expression on UMAP

ADD REPLY • link 23 months ago by Giuseppe ▴ 80

0

Entering edit mode

Hi Giuseppe,

Thank you very much to provide precise suggestions. It teached me a lot. I tried your method, and it came out to be fine and got the Umap which was very similar with your plot.

Since some part of my codes referred to the online information. I didn`t sure whether my understanding or what they said was right. Bellowing is my reference:

Reference link:

Belowing is some my own understanding to your questions and a few further questions:

You are working with log-transformed counts, which I assume are also normalized (this is a big assumption). So, again in theory and based on assumptions, they should have been filtered by the authors.

--Yes, I agreed with you that the cells were filtered by the author. It is not necessary to filter it again. I just applied a very loose criterion to filter it ~~.

You input it in a Seurat object as counts and then you filter the object. Then you reassign the original log count table - that is, before filtering - to the object's data slot.

--since the input matrix is normalized and we didn`t need to normalize it again. We need to overwrite the normalized and log-transformed matrix into BoM_SO.filter@assays$RNA@data which stored the normalized and log-transformed matrix after running “NormalizeData” in seurat object. You can see reference link 1# and 4# for the further information.

You also assigned to another slot in misc called "seurat_data". Why?

--To be honest, I didn`t know the exactly reason. My understanding is that it is optional. You can see reference link 4# for the further information.

and you can go on with filtering, variable features, scaling and PCA, etc and can even plot genes now that we have them in the row names.

--I am not sure that my understanding is correct. Do you mean that I didn`t need to normalize the data by “NormalizeData” and also not necessary to overwrite the normalized and log-transformed counts into BoM_SO.filter@assays$RNA@data by "BoM_SO.filter@assays$RNA@data = as.matrix(data)"? If so, how can the Seurat know where was the normalized and log-transformed counts?

ADD REPLY • link 23 months ago by tujuchuanli ▴ 130

0

Entering edit mode

Hi, here some answers.

--since the input matrix is normalized and we didn`t need to normalize it again. We need to overwrite the normalized and log-transformed matrix into BoM_SO.filter@assays$RNA@data which stored the normalized and log-transformed matrix after running “NormalizeData” in seurat object. You can see reference link 1# and 4# for the further information.

This is alright, but in your code you filter the Seurat object first, and then you assign the unfiltered (or pre-filtered) normalized count matrix that you downloaded from GEO to the data slot. This means that there may be some inconsistencies between the number of cells and/or genes in the object (weirdly enough though Seurat does not throw an error), and what is in the normalized counts you supply. For this reason I don't think that operation makes sense. Since you know that the filtering was already done, you can create an object without any filtering and assign the matrix you downloaded to the data slot.

--To be honest, I didn`t know the exactly reason. My understanding is that it is optional. You can see reference link 4# for the further information.

In the context of that GitHub issue reply, the user is suggesting that you store the Seurat-normalized data in the misc slot before you add your own normalization to the data slot. This means that if you want to e.g. compare the result of NormalizeData() with another normalization method, you can keep both of them in the same object. This does not apply to your case for two reasons: 1) you do not have the result of NormalizeData() because you did not supply raw counts to the object; you only have log-transformed, probably normalized counts, nothing to compare it to. 2) in your code you are first assigning the log counts to the data slot, and then copying them to the misc slot. So, even if you had the "original" result of NormalizeData() in the data slot, you would first be overwriting it with the external data and then copying it. This is why it does not make sense.

--I am not sure that my understanding is correct. Do you mean that I didn`t need to normalize the data by “NormalizeData” and also not necessary to overwrite the normalized and log-transformed counts into BoM_SO.filter@assays$RNA@data by "BoM_SO.filter@assays$RNA@data = as.matrix(data)"? If so, how can the Seurat know where was the normalized and log-transformed counts?

exactly; if you supply log-counts to the data slot of the Seurat object, you don't need to normalize them. Seurat uses the data in the data slot for HVG selection and scaling, so it already "knows" where to find it.

An answer from your #4 link mentions an important thing to take into account: the base of the logarithm used for log transformation matters. If the log counts you download from GEO are in natural log, then it's ok; otherwise you would need to re-transform them.

ADD REPLY • link 23 months ago by Giuseppe ▴ 80

0

Entering edit mode

Hi Giuseppe,

Thank you for your replying. Just one more question to make sure

Exactly; if you supply log-counts to the data slot of the Seurat object, you don't need to normalize them. Seurat uses the data in the data slot for HVG selection and scaling, so it already "knows" where to find it.

--Do you mean that I still need to overwrite the BoM_SO.filter@assays$RNA@data by BoM_SO.filter@assays$RNA@data = as.matrix(data)? I understood what you said above was that I need to supply the log-counts to the data slot of the Seurat object. If so, the result of my Umap plot was still abnormal just like before.

If not, how do the Seurat know what I supply to it is log-counts when constructing Seurat object? And if skipping the normalization step, what is the data in data slot? Do the Seurat pass the data in counts slot to the data slot if skipping the normalization step?

ADD REPLY • link 23 months ago by tujuchuanli ▴ 130

0

Entering edit mode

I tested it and found that the input counts matrix will be writed into both counts and data slot regardless of the format of the input counts matrix (raw counts or log-counts) when constructing seurat object. The normalization step will normalized the matrix in data slot and overwrite it by normalized and log transformed matrix when executing “NormalizeData”. It seemed ok when supply log-counts to construct Seurat object and do not execute “NormalizeData”.

However, the final Umap plot was still abnormal when executing “BoM_SO.filter@assays$RNA@data = data” where data was a sparse matrix. As far as I know, it should be the same as without it. Do you know the reason?

ADD REPLY • link 23 months ago by tujuchuanli ▴ 130

0

Entering edit mode

Hi, like I mentioned before, you should not include the NormalizeData() step because you are already supplying normalized counts.

Regarding the final UMAP plot, did you include the step where you remove the character column (I assume you did because you are transforming it to a sparse matrix)? Anyway, if you follow the code in my initial answer you should be alright; messing too much with the inputs and slots of an object is anyway never a good idea, since you would be changing the inputs that functions expect and will generate unpredictable or implausible results.

ADD REPLY • link 23 months ago by Giuseppe ▴ 80