Question

How do I integrate public annotated data onto my data set?

0

Entering edit mode

1 day ago

Assa Yeroslaviz ★ 1.9k

Hi,

I'm trying to analyze my data set using the annotated data from the tabula muris repository.

I have downloaded the FACS-sorted data, read it into R, as well as my own data set.

After QC and filtering of both sets, I have tried to integrate them, but now I'm not sure how to proceed.

first pre-processing my data set

standard <- NormalizeData(standard, normalization.method  ="LogNormalize", scale.factor = 10000)
standard <- FindVariableFeatures(standard, selection.method = "mean.var.plot",
                                mean.cutoff = c(0.0125, 3),
                                dispersion.cutoff = c(0.5, Inf))
standard <- ScaleData(standard, vars.to.regress = c("nCount_RNA", "percent.mt"), verbose = TRUE)

This data is not annotated and I would like to use the annotations from the tabula muris data to achieve that.

Next, I prepare the data from tabula muris. After creating the Seurat object.list from several tables I pre-process the list.

for (i in 1:length(object.list)) {
  object.list[[i]] <- NormalizeData(object.list [[i]], verbose = FALSE)
  object.list[[i]] <- FindVariableFeatures(
    object.list[[i]], selection.method = "vst",
    nfeatures = 2000, verbose = FALSE)
  object.list[[i]] <- ScaleData(object.list [[i]], verbose = FALSE) # Scale the data using the selected features.
  object.list[[i]]$batch <- paste0("Batch", i)
}

Followed by the integration

integration.features <- SelectIntegrationFeatures(object.list = object.list, nfeatures = 2000)
anchors <- FindIntegrationAnchors(
    object.list = object.list,
    anchor.features = integration.features
)
integrated <- IntegrateData(anchorset = anchors, normalization.method = "LogNormalize")

This results in another Seurat Object.

Now I need to combine them, but I'm not sure how to do so.

Do I need to use the Seurat object, Integrated or better to use the anchors object to annotate my standard object?

Can I just do merge?

combined <- merge(integrated, y = standard)
combined <- ScaleData(combined)
combined <- RunPCA(combined)
combined <- RunUMAP(combined, dims = 1:30)
#DimPlot(combined, group.by = "batch", reduction = "umap") + ggtitle("Integration Quality by Batch")

SelectIntegrationFeatures IntegrateData Seurat tabula-muris • 287 views

ADD COMMENT • link updated 22 hours ago by Bastien Hervé 5.8k • written 1 day ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

What you are integrating at the moment, from your code, are the different batches of the tabula muris.

Which version of Seurat do you have, 4 or 5 ?

Also, why are you regressing nCount_RNA ? This will be normalized at NormalizeData step

ADD REPLY • link 1 day ago by Bastien Hervé 5.8k

0

Entering edit mode

The object.list integrate different tissues files. batch* is just a way for me t differentiate between their origin.

Yes I'm using V5.

didn't see my mistake there, thanks for noticing.

ADD REPLY • link 1 day ago by Assa Yeroslaviz ★ 1.9k

score 1 · Answer 1 · 2024-11-20

1

Entering edit mode

1 day ago

Bastien Hervé 5.8k

I have not tried it

Edit : I think the integration can handle multiple level of correction (batches from tabula muris and your personal data)

standard$origin <- "personal"

for (i in 1:length(object.list)) {      
  object.list[[i]]$origin <- paste0("Batch", i)
}

Seurat version 5 use layers rather than list of objects to integrate.

integrated <- merge(standard, y = object.list)
integrated[["RNA"]] <- split(integrated[["RNA"]], f = integrated$origin)
integrated <- NormalizeData(integrated)
integrated <- FindVariableFeatures(integrated)
integrated <- ScaleData(integrated)
integrated <- RunPCA(integrated)
#Here you can use something else than Harmony
integrated <- IntegrateLayers(integrated, method = HarmonyIntegration, orig.reduction = "pca", new.reduction = "harmony", verbose = FALSE)
integrated <- FindNeighbors(integrated, reduction = "harmony", dims = 1:30)
integrated <- FindClusters(integrated, resolution = 2, cluster.name = "harmony_clusters")
integrated <- RunUMAP(integrated, reduction = "harmony", dims = 1:30, reduction.name = "umap.harmony")

ADD COMMENT • link 1 day ago by Bastien Hervé 5.8k

0

Entering edit mode

thanks for that. It seems to work, but the integration doesn't work as expected. below is the umap representation.

Some genes are integrated, but for the most part they are separate.

Is it possible, this is because the rownames are different? I know both data sets are from mouse, but looking at the gene name differences i see this:

setdiff(rownames(standard), rownames(object.list[[1]])) |> length()
[1] 9853
> intersect(rownames(standard), rownames(object.list[[1]])) |> length()
[1] 13847

Does this means I need to look for a different annotated data set?

DimPlot

P.S. The standard data set contains two data sets original IDs C7 and G7, while the object list original IDs are all smartSeq2 (from theFACS+sequecning)

ADD REPLY • link 1 day ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

The anchors will be selected among common features. Check why you have so much differences in your features, that will be problematic.

You can try to integrate with CCA instead of Harmony.

ADD REPLY • link 1 day ago by Bastien Hervé 5.8k

0

Entering edit mode

Using CCA look a lot better, but Maybe I misunderstand the integration process. the two data sets have different sample names (sample = cell). When I look at the metadata of the integrated data set I can see the names of the two data sets. But how do I now combine the annotations of the data set? How Do I transfer the known annotation from the tabula muris data onto the unknown data of standard?

Do I need to use here FindTransferAnchors followed by TransferData to get the annotations onto the new data?

> table(integrated$cell_ontology_class)
basal cell of epidermis          epidermal cell  keratinocyte stem cell               leukocyte 
                    539                     224                    1362                      10 
 stem cell of epidermis 
                     23 
> table(is.na(integrated$cell_ontology_class))    
FALSE  TRUE 
 2158 20424

CCA

ADD REPLY • link 1 day ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

You were looking for a LabelTransfer then, not an integration.

https://satijalab.org/seurat/articles/integration_mapping

ADD REPLY • link 1 day ago by Bastien Hervé 5.8k

0

Entering edit mode

sorry for the confusion and still thanks for the help.

No I wanted both to be honest. my plan was to first take one data set from tabula muris and then transfer the labels to my data set, but reading more about it made more sense to use multiple data sets. This is why, I needed to first integrate the different data sets from tabula muris into one object and then transfer the labels identified there onto my data.

My workflow looks like that now:

pre-process my own data set with no annotations (creating seurat objects, filtering, normalization, scaling and varibale feature identification), e.g. query
1. download, and pre-process the tabula muris data (creating seurat objects, filtering, normalization, scaling and varibale feature identification), e.g. reference
2. find anchors between the reference and the query data sets
3. Transfer the data gained from the reference data
4. add the annotation as a metadata column onto the query data set.

and the code for that is below:

### Pre-process the un-annotated data
standard <- NormalizeData(standard)
standard <- FindVariableFeatures(standard)
standard <- ScaleData(standard)
standard <- RunPCA(standard)
standard <- FindNeighbors(standard, dims = 1:50)
standard <- FindClusters(standard)
#### pre-process the annotated data from tabula muris
object.list <- list(seu_smart1, seu_smart2, seu_smart3) # I can merge them w.o. creating a list. 
integrated.list <- merge(object.list[[1]], list(object.list[[2]], object.list[[3]])) # Normalization doesn't work on a list
integrated.list <- NormalizeData(integrated.list)
integrated.list <- FindVariableFeatures(integrated.list)
integrated.list <- ScaleData(integrated.list)
integrated.list <- RunPCA(integrated.list)
integrated.list <- IntegrateLayers(integrated.list, method = CCAIntegration, orig.reduction = "pca", new.reduction = "cca", verbose = FALSE)
integrated.list <- FindNeighbors(integrated.list, reduction = "cca", dims = 1:30)
integrated.list <- FindClusters(integrated.list, resolution = seq(0,1,0.1))
integrated.list <- RunUMAP(integrated.list, reduction = "cca", dims = 1:30)
####  identify common anchors and merge the to the new data
anchors <- FindTransferAnchors(reference = integrated.list, query = standard, dims = 1:50, reference.reduction = "pca")
predictions <- TransferData(anchorset = anchors, refdata = integrated.list$cell_ontology_class, dims = 1:50)
standard <- AddMetaData(standard, metadata = predictions)

Is the integration step necessary? Can one use the split object to identify anchors? I mean technically it does work, but I am not sure, if this is the better way to do it.

Would this now be the correct approach?

ADD REPLY • link 1 day ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

What you do here is correct

ADD REPLY • link 22 hours ago by Bastien Hervé 5.8k