Hi Nuorain,
Thank you for sharing the screenshot of your environment.
Using the in-house dataset dataBRCA for tabDF, I noticed the row names are Entrez Gene IDs (used by NCBI) and gene symbols separated by a vertical bar, and are not Ensembl IDs:
> head(dataBRCA)
TCGA-A7-A13D-01A-13R-A12P-07 TCGA-BH-A0DK-11A-13R-A089-07 TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-E9-A1RH-11A-34R-A169-07
?|100133144 20.47 54.59 7.84 6.05
?|100134869 9.53 8.41 9.16 8.95
?|10357 258.79 241.38 108.00 229.01
It's possible but not ideal nor free of potential issues to map these Entrez Gene IDs to Ensembl IDs. Here is a first-order approach with print statements and comments, if helpful.
### Load dependencies. ###
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TCGAbiolinks")
BiocManager::install("EDASeq")
#Load.
library(TCGAbiolinks)
library(EDASeq)
library(biomaRt)
library(dplyr)
### Prepare dataBRCA. ###
#Retain integer Entrez Gene ID part of your query data.
entrez_gene_ids <- rownames(dataBRCA)
entrez_gene_ids_clean <- as.character(sub(".*\\|", "", entrez_gene_ids))
#Retrieve Ensembl gene IDs.
ensembl_human <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
gene_mapping <- getBM(
attributes = c("entrezgene_id", "ensembl_gene_id", "hgnc_symbol"),
filters = "entrezgene_id",
values = entrez_gene_ids_clean,
mart = mart
)
gene_mapping$entrezgene_id <- as.character(gene_mapping$entrezgene_id)
gene_mapping$hgnc_symbol <- as.character(gene_mapping$hgnc_symbol)
#Subset.
entrez_gene_ids_clean <- strsplit(entrez_gene_ids, "\\|")
before_pipe <- sapply(entrez_gene_ids_clean, `[`, 1) # Part before "|"
after_pipe <- sapply(entrez_gene_ids_clean, `[`, 2) # Part after "|"
#Method 1: excludes entries with ? for gene symbol.
# subset_ME <- gene_mapping[gene_mapping$entrezgene_id %in% after_pipe
# & gene_mapping$hgnc_symbol %in% before_pipe, ]
#Method 2: includes entries with ? for gene symbol, and ensemble ID for
#genes with identical entrezgeneid and gene symbol.
overlap_gene_symbol <- intersect(gene_mapping$hgnc_symbol, before_pipe)
length(overlap_gene_symbol) #16865 total overlapping genes by symbol between dataBRCA and Ensembl.
overlap_geneid <- intersect(gene_mapping$entrezgene_id, after_pipe)
length(overlap_geneid) #19644 total overlapping genes by geneID between dataBRCA and Ensembl.
matches <- rep(FALSE, nrow(gene_mapping))
for (i in seq_along(before_pipe)) {
matches <- matches | ((before_pipe[i] == "?" | gene_mapping$hgnc_symbol == before_pipe[i]) &
gene_mapping$entrezgene_id == after_pipe[i])
}
#> table(matches)
# matches
# FALSE TRUE
# 3268 19515
#There are 19515 entries in gene list that matches with our query data.
#Subset matching indices.
#There are 28 duplicated gene symbols, which we cannot
#accurately map to our gene symbol.
subset <- gene_mapping[matches, ]
#Original gene list has 22783 total genes. Query data (dataBRCA)
#has 20530 total genes. Their overlap differs by type of identifier for genes.
#Overall, there are 19515 matching genes between gene list and query
#based on above logic.
#Order subset of gene list by query data.
#Find matches.
subset$match_index <- match(subset$entrezgene_id, after_pipe)
#Order df.
sorted_subset <- subset %>%
arrange(match_index) %>%
select(-match_index, -hgnc_symbol)
#For checking.
shared <- match(sorted_subset$entrezgene_id, after_pipe) #19515 original
subset_query <- dataBRCA[shared,]
rownames(subset_query) <- sorted_subset$ensembl_gene_id
#dataBRCA genes now with Ensembl IDs.
#Find sample to test.
#Use index to confirm order in ref gene list.
#matches <- grep("\\b100\\b", after_pipe)
#Normalize.
norm_data <- TCGAanalyze_Normalization(
tabDF=subset_query,
geneInfo=geneInfoHT,
method="gcContent"
)
> norm_data
TCGA-A7-A13D-01A-13R-A12P-07 TCGA-BH-A0DK-11A-13R-A089-07 TCGA-BH-A1FC-11A-32R-A13Q-07 TCGA-E9-A1RH-11A-34R-A169-07
ENSG00000000003 3144 2951 4004 2953
ENSG00000000005 0 26 1082 571
ENSG00000000419 2098 1062 1381 1126
Using Ensembl IDs (or the appropriate gene name IDs) as row names for your tabDF object resolves the initial error related to an empty vector.
Maze
Hi Nuorain,
It looks like either the RNAseq matrix or the geneInfo matrix is in the wrong format or has mismatching identifiers. The error points to an expected vector of length 2 but it is 0 or empty.
It may be that the two objects you pass, the purified_data RNAseq matrix and the geneInfo matrix, have mismatching gene identifiers. Below are possible values for the second argument of TCGAanalyze_Normalization(), provided in the documentation for the library TCGAbiolinks (which offers the function you're calling):
Perhaps try passing geneInfoHT as the second argument to see if it's compatible with your RNAseq matrix.
Hope this is helpful.
Maze
Hi Maze, When I passing geneInfoHT as the second arguments, it gave me the error:
Okay. Would you run the following please and share, if possible?
Hi Maze, Thanks for your reply. Here is what I got.
Hi Maze, Thanks for your reply. Here is what I got.
Thank you. Given you're running the following:
The issue could be that the tabDF matrix does not have gene identifiers as row names (in this case, Ensembl IDs since geneInfoHT has Ensembl IDs as shown by the prefix ENSG). This would allow both matrices to match and the function to effectively normalize the data. Here is the requirement in the TCGAbiolinks doc for the tabDF argument of the function:
This could be the reason for the initial zero-length vector error since the function may internally be matching the genes in tabDF with those in geneInfoHT to then proceed with normalization by GC content.
Perhaps try adding the appropriate Ensembl gene IDs as row names to the RNAseq matrix.
Please let me know if this works.
Maze
Hi Maze, Could you please tell me how to add the appropriate Ensemble gene IDs as row names to the RNAseq matrix? I am very new to the field. Thanks.
Hi Nuorain,
Since I don't see any gene identifier in your RNAseq matrix via head(), you will likely need to obtain them. Once you have your gene IDs (properly ordered to describe the correct counts) in a vector, you can proceed. If they're gene symbols, you will have to map the gene symbols to Ensembl gene IDs to use geneInfoHT. With this vector, gene_ids for instance, you can set it as row names:
This should allow TCGAanalyze_Normalization() to match genes between your data and geneInfoHT and effectively perform.
Please confirm if you can find the gene IDs that describe the values in your RNAseq matrix for each TCGA sample.
Maze
Hi Maze, I found the following mismatch:
Is it my script problem or the system has an issue? Thanks.
This seems to reflect the incompatibility due to a framework update. Nice catch.
I'd recommend in-house normalization methods via differential analysis packages like edgeR or DESeq2.
Maze
Thanks Maze. The above is the results I got . The issue persist.
Please use
101010
button to formatcode
with monospaced font.>
(quote) should be used for quoting parts of text.Hi GenoMax, thanks for your reply. Could you please tell me how to do it?
Select the part of text you want to represent as
code
and then click on the101010
button when you are in the edit window.Hi GenoMax, Thanks for your input. Are you saying I selected the code in the script part, then click 101010 in the console? I did it. The issue persist. Thanks again.
Use your mouse to select the text in edit window. Then click on
101010
button. That should do it.Also I used ChatGPT to help me format the code. I copied and pasted the new code from ChatGPT and I got the same results.
Does this issue have anything to do with the workflow.type? Since "HTSeq - Counts" is not longer available in TCGA data. We use STAR-counts now. Thanks for your input.
Yes I believe you're right. It appears the GDC data is STAR counts and no longer HTSeq counts for its greater advantages (e.g., efficiency). This explains how TCGAbiolinks which uses the data doesn't work as it once did. I created a script to circumvent this, but it doesn't seem ideal.
Maze