Question

TCGA Biolinks code issue

0

Entering edit mode

12 weeks ago

nuorain ▴ 30

Hello,

I try to normalize my data using the code:

dataNorm <- TCGAanalyze_Normalization(tabDF = puried_data,
                                      geneInfo = geneInfo,
                                      method = "gcContent")

But I keep getting the error:

I Need about  33 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]  
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Error in names(y) <- 1:length(y) : 
  'names' attribute [2] must be the same length as the vector [0]

Could you please help me out?

Thanks

normalization TCGAbiolinks • 1.9k views

ADD COMMENT • link updated 12 weeks ago by mazegriff ▴ 100 • written 12 weeks ago by nuorain ▴ 30

1

Entering edit mode

Hi Nuorain,

It looks like either the RNAseq matrix or the geneInfo matrix is in the wrong format or has mismatching identifiers. The error points to an expected vector of length 2 but it is 0 or empty.

It may be that the two objects you pass, the purified_data RNAseq matrix and the geneInfo matrix, have mismatching gene identifiers. Below are possible values for the second argument of TCGAanalyze_Normalization(), provided in the documentation for the library TCGAbiolinks (which offers the function you're calling):

geneInfo Information matrix of 20531 genes about geneLength and gcContent. Two objects are provided: TCGAbiolinks::geneInfoHT,TCGAbiolinks::geneInfo

Perhaps try passing geneInfoHT as the second argument to see if it's compatible with your RNAseq matrix.

Hope this is helpful.

Maze

ADD REPLY • link 12 weeks ago by mazegriff ▴ 100

0

Entering edit mode

Hi Maze, When I passing geneInfoHT as the second arguments, it gave me the error:

Error in TCGAanalyze_Normalization(tabDF = puried_data, method = "gcContent") : 
  argument "geneInfo" is missing, with no default

ADD REPLY • link updated 12 weeks ago by GenoMax 148k • written 12 weeks ago by nuorain ▴ 30

1

Entering edit mode

Okay. Would you run the following please and share, if possible?

> head(geneInfoHT)

> head(puried_data)

ADD REPLY • link updated 12 weeks ago by GenoMax 148k • written 12 weeks ago by mazegriff ▴ 100

0

Entering edit mode

Hi Maze, Thanks for your reply. Here is what I got.

> head(geneInfoHT)
                geneLength gcContent
ENSG00000000003       4536 0.3992504
ENSG00000000005       1476 0.4241192
ENSG00000000419       9276 0.4252911
ENSG00000000457       6883 0.4117391
ENSG00000000460       5970 0.4298157
ENSG00000000938       3382 0.5644589
> 
A-21R-A083-07 TCGA-A6-2681-01A-01R-1410-07

ADD REPLY • link updated 12 weeks ago by GenoMax 148k • written 12 weeks ago by nuorain ▴ 30

0

Entering edit mode

Hi Maze, Thanks for your reply. Here is what I got.

> head(geneInfoHT)
                geneLength gcContent
ENSG00000000003       4536 0.3992504
ENSG00000000005       1476 0.4241192
ENSG00000000419       9276 0.4252911
ENSG00000000457       6883 0.4117391
ENSG00000000460       5970 0.4298157
ENSG00000000938       3382 0.5644589
> 
> head(puried_data)
     TCGA-AA-3692-01A-01R-0905-07 TCGA-AA-A00L-01A-01R-A002-07
[1,]                         8872                         4239
[2,]                            7                            1
     TCGA-G4-6625-01A-21R-1774-07 TCGA-AA-3522-01A-01R-0821-07
[1,]                         7542                         4069
[2,]                            7                           22
     TCGA-CM-4750-01A-01R-1410-07 TCGA-AA-3561-01A-01R-0821-07
[1,]                         5328                         5919
[2,]                          214                           24

ADD REPLY • link updated 12 weeks ago by GenoMax 148k • written 12 weeks ago by nuorain ▴ 30

0

Entering edit mode

Thank you. Given you're running the following:

> dataNorm <- TCGAanalyze_Normalization(tabDF = puried_data,
  geneInfo = geneInfoHT,
  method = "gcContent"
)

The issue could be that the tabDF matrix does not have gene identifiers as row names (in this case, Ensembl IDs since geneInfoHT has Ensembl IDs as shown by the prefix ENSG). This would allow both matrices to match and the function to effectively normalize the data. Here is the requirement in the TCGAbiolinks doc for the tabDF argument of the function:

tabDF Rnaseq numeric matrix, each row represents a gene, each column represents a sample

This could be the reason for the initial zero-length vector error since the function may internally be matching the genes in tabDF with those in geneInfoHT to then proceed with normalization by GC content.

Perhaps try adding the appropriate Ensembl gene IDs as row names to the RNAseq matrix.

Please let me know if this works.

Maze

ADD REPLY • link updated 12 weeks ago by GenoMax 148k • written 12 weeks ago by mazegriff ▴ 100

0

Entering edit mode

Hi Maze, Could you please tell me how to add the appropriate Ensemble gene IDs as row names to the RNAseq matrix? I am very new to the field. Thanks.

ADD REPLY • link 12 weeks ago by nuorain ▴ 30

0

Entering edit mode

Hi Nuorain,

Since I don't see any gene identifier in your RNAseq matrix via head(), you will likely need to obtain them. Once you have your gene IDs (properly ordered to describe the correct counts) in a vector, you can proceed. If they're gene symbols, you will have to map the gene symbols to Ensembl gene IDs to use geneInfoHT. With this vector, gene_ids for instance, you can set it as row names:

rownames(puried_data) <- gene_ids

This should allow TCGAanalyze_Normalization() to match genes between your data and geneInfoHT and effectively perform.

Please confirm if you can find the gene IDs that describe the values in your RNAseq matrix for each TCGA sample.

Maze

ADD REPLY • link 12 weeks ago by mazegriff ▴ 100

1

Entering edit mode

enter image description here

ADD REPLY • link 12 weeks ago by nuorain ▴ 30

0

Entering edit mode

Hi Maze, I found the following mismatch:

> dim(geneInfoHT)
[1] 68016     2
> dim(puried_data)
[1] 60660   495

Is it my script problem or the system has an issue? Thanks.

ADD REPLY • link updated 12 weeks ago by GenoMax 148k • written 12 weeks ago by nuorain ▴ 30

1

Entering edit mode

This seems to reflect the incompatibility due to a framework update. Nice catch.

I'd recommend in-house normalization methods via differential analysis packages like edgeR or DESeq2.

Maze

ADD REPLY • link 12 weeks ago by mazegriff ▴ 100

0

Entering edit mode

Thanks Maze. The above is the results I got . The issue persist.

ADD REPLY • link 12 weeks ago by nuorain ▴ 30

1

Entering edit mode

Please use 101010 button to format code with monospaced font.

> (quote) should be used for quoting parts of text.

ADD REPLY • link 12 weeks ago by GenoMax 148k

0

Entering edit mode

Hi GenoMax, thanks for your reply. Could you please tell me how to do it?

ADD REPLY • link 12 weeks ago by nuorain ▴ 30

0

Entering edit mode

Select the part of text you want to represent as code and then click on the 101010 button when you are in the edit window.

ADD REPLY • link 12 weeks ago by GenoMax 148k

0

Entering edit mode

Hi GenoMax, Thanks for your input. Are you saying I selected the code in the script part, then click 101010 in the console? I did it. The issue persist. Thanks again.

> 101010
[1] 101010
> dataNorm <- TCGAanalyze_Normalization(tabDF = puried_data,
+                                       geneInfo = geneInfo,
+                                       method = "gcContent")
I Need about  375 seconds for this Complete Normalization Upper Quantile  [Processing 80k elements /s]  
Step 1 of 4: newSeqExpressionSet ...
Step 2 of 4: withinLaneNormalization ...
Error in names(y) <- 1:length(y) : 
  'names' attribute [2] must be the same length as the vector [0]
>

ADD REPLY • link updated 12 weeks ago by GenoMax 148k • written 12 weeks ago by nuorain ▴ 30

0

Entering edit mode

Use your mouse to select the text in edit window. Then click on 101010 button. That should do it.

ADD REPLY • link 12 weeks ago by GenoMax 148k

0

Entering edit mode

Also I used ChatGPT to help me format the code. I copied and pasted the new code from ChatGPT and I got the same results.

ADD REPLY • link 12 weeks ago by nuorain ▴ 30

1

Entering edit mode

Does this issue have anything to do with the workflow.type? Since "HTSeq - Counts" is not longer available in TCGA data. We use STAR-counts now. Thanks for your input.

ADD REPLY • link 12 weeks ago by nuorain ▴ 30

0

Entering edit mode

Yes I believe you're right. It appears the GDC data is STAR counts and no longer HTSeq counts for its greater advantages (e.g., efficiency). This explains how TCGAbiolinks which uses the data doesn't work as it once did. I created a script to circumvent this, but it doesn't seem ideal.

Maze

ADD REPLY • link 12 weeks ago by mazegriff ▴ 100

score 1 · Answer 1 · 2024-10-03

Hi Nuorain,

Thank you for sharing the screenshot of your environment.

Using the in-house dataset dataBRCA for tabDF, I noticed the row names are Entrez Gene IDs (used by NCBI) and gene symbols separated by a vertical bar, and are not Ensembl IDs:

>  head(dataBRCA)
TCGA-A7-A13D-01A-13R-A12P-07 TCGA-BH-A0DK-11A-13R-A089-07 TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-E9-A1RH-11A-34R-A169-07
        ?|100133144                        20.47                        54.59                         7.84                         6.05
        ?|100134869                         9.53                         8.41                         9.16                         8.95
        ?|10357                           258.79                       241.38                       108.00                       229.01

It's possible but not ideal nor free of potential issues to map these Entrez Gene IDs to Ensembl IDs. Here is a first-order approach with print statements and comments, if helpful.

### Load dependencies. ###
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
BiocManager::install("EDASeq")


#Load. 
library(TCGAbiolinks)
library(EDASeq)
library(biomaRt)
library(dplyr)

### Prepare dataBRCA. ###
#Retain integer Entrez Gene ID part of your query data.
entrez_gene_ids <- rownames(dataBRCA)
entrez_gene_ids_clean <- as.character(sub(".*\\|", "", entrez_gene_ids))
#Retrieve Ensembl gene IDs.
ensembl_human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
gene_mapping <- getBM(
  attributes = c("entrezgene_id", "ensembl_gene_id", "hgnc_symbol"),
  filters = "entrezgene_id",
  values = entrez_gene_ids_clean,
  mart = mart
)
gene_mapping$entrezgene_id <- as.character(gene_mapping$entrezgene_id)
gene_mapping$hgnc_symbol <- as.character(gene_mapping$hgnc_symbol)

#Subset.
entrez_gene_ids_clean <- strsplit(entrez_gene_ids, "\\|")
before_pipe <- sapply(entrez_gene_ids_clean, `[`, 1)  # Part before "|"
after_pipe <- sapply(entrez_gene_ids_clean, `[`, 2)   # Part after "|"
#Method 1: excludes entries with ? for gene symbol.
# subset_ME <- gene_mapping[gene_mapping$entrezgene_id %in% after_pipe
#                 & gene_mapping$hgnc_symbol %in% before_pipe, ]
#Method 2: includes entries with ? for gene symbol, and ensemble ID for 
#genes with identical entrezgeneid and gene symbol.
overlap_gene_symbol <- intersect(gene_mapping$hgnc_symbol, before_pipe)
length(overlap_gene_symbol) #16865 total overlapping genes by symbol between dataBRCA and Ensembl.
overlap_geneid <- intersect(gene_mapping$entrezgene_id, after_pipe)
length(overlap_geneid) #19644 total overlapping genes by geneID between dataBRCA and Ensembl.
matches <- rep(FALSE, nrow(gene_mapping))
for (i in seq_along(before_pipe)) {
  matches <- matches | ((before_pipe[i] == "?" | gene_mapping$hgnc_symbol == before_pipe[i]) &
                          gene_mapping$entrezgene_id == after_pipe[i])
}
#> table(matches)
# matches
# FALSE  TRUE 
# 3268 19515
#There are 19515 entries in gene list that matches with our query data.
#Subset matching indices.
#There are 28 duplicated gene symbols, which we cannot
#accurately map to our gene symbol.
subset <- gene_mapping[matches, ]
#Original gene list has 22783 total genes. Query data (dataBRCA) 
#has 20530 total genes. Their overlap differs by type of identifier for genes.
#Overall, there are 19515 matching genes between gene list and query
#based on above logic.

#Order subset of gene list by query data. 
#Find matches. 
subset$match_index <- match(subset$entrezgene_id, after_pipe)
#Order df.
sorted_subset <- subset %>%
  arrange(match_index) %>%
  select(-match_index, -hgnc_symbol)

#For checking.
shared <- match(sorted_subset$entrezgene_id, after_pipe) #19515 original
subset_query <- dataBRCA[shared,]
rownames(subset_query) <- sorted_subset$ensembl_gene_id
#dataBRCA genes now with Ensembl IDs.

#Find sample to test.
#Use index to confirm order in ref gene list.
#matches <- grep("\\b100\\b", after_pipe)

#Normalize. 
norm_data <- TCGAanalyze_Normalization(
  tabDF=subset_query,
  geneInfo=geneInfoHT,
  method="gcContent"
)

> norm_data
                TCGA-A7-A13D-01A-13R-A12P-07 TCGA-BH-A0DK-11A-13R-A089-07 TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-E9-A1RH-11A-34R-A169-07
ENSG00000000003                         3144                         2951                         4004                         2953
ENSG00000000005                            0                           26                         1082                          571
ENSG00000000419                         2098                         1062                         1381                         1126

Using Ensembl IDs (or the appropriate gene name IDs) as row names for your tabDF object resolves the initial error related to an empty vector.

Maze