Question

Network analysis of TCGA and GTEx gene expression datasets

0

Entering edit mode

5 months ago

Nusrat • 0

Can anyone please help?

I was doing DEG analysis using R studio following the star protocol of network analysis for Testicular Cancer. Link is attached of that protocol. Got stuck in this step 7. enter image description here

and this is the output it gives enter image description here

this is protocol link: link

here's the full R code:

setwd("/pathname")

###Step3
library(UCSCXenaTools)
library(data.table)
library(R.utils)
library(dplyr)


###Step4
data(XenaData)
write.csv(XenaData, "00_tblXenaHubInfo.csv")


###Step5
GeneExpectedCnt_toil = XenaGenerate(subset = XenaHostNames == "toilHub") %>%
  XenaFilter(filterCohorts = "TCGA TARGET GTEx") %>%
  XenaFilter(filterDatasets = "TcgaTargetGtex_gene_expected_count")
XenaQuery(GeneExpectedCnt_toil) %>%
  XenaDownload(destdir = "./")

paraCohort = "TCGA testicular Cancer"; #Selecting the Testicular Cancer cohort.
paraDatasets = "TCGA.TGCT.sampleMap/TGCT_clinicalMatrix"; #Selecting the Testicular Cancer clinical matrix.
Clin_TCGA = XenaGenerate(subset = XenaHostNames == "tcgaHub") %>%
  XenaFilter(filterCohorts = paraCohort) %>%
  XenaFilter(filterDatasets = paraDatasets);
XenaQuery(Clin_TCGA) %>%
  XenaDownload(destdir = "./")

Surv_TCGA = XenaGenerate(subset = XenaHostNames == "toilHub") %>%
  XenaFilter(filterCohorts = "TCGA TARGET GTEx") %>%
  XenaFilter(filterDatasets = "TCGA_survival_data");
XenaQuery(Surv_TCGA) %>%
  XenaDownload(destdir = "./")

Pheno_GTEx = XenaGenerate(subset = XenaHostNames == "toilHub") %>%
  XenaFilter(filterCohorts = "TCGA TARGET GTEx") %>%
  XenaFilter(filterDatasets = "TcgaTargetGTEX_phenotype");
XenaQuery(Pheno_GTEx) %>%
  XenaDownload(destdir = "./")



###Step6
filterGTEx01 = fread("TcgaTargetGTEX_phenotype.txt.gz");
names(filterGTEx01) = gsub("\\_", "", names(filterGTEx01));

paraStudy = "GTEX"; #Setting "GTEx" as the study of interest.
paraPrimarySiteGTEx = "Testes"; 
paraPrimaryTissueGTEx = "^Testes"; 

filterGTEx02 = subset(filterGTEx01,
                      study == paraStudy &
                        primarysite == paraPrimarySiteGTEx &
                        grepl(paraPrimaryTissueGTEx, filterGTEx01$`primary disease or tissue`))

filterTCGA01 = fread(paraDatasets);
names(filterTCGA01) = gsub("\\_", "", names(filterTCGA01));
paraSampleType = "Primary Tumor"; #Setting "Primary Tumor" as the sample type of interest.

paraPrimarySiteTCGA = "Testes"; 
paraHistologicalType = "Testicular Germ Cell Tumor"; 

filterTCGA02 = subset(filterTCGA01,
                      sampletype == paraSampleType &
                        primarysite == paraPrimarySiteTCGA &
                        grepl(paraHistologicalType, filterTCGA01$histologicaltype))

filterExpr = c(filterGTEx02$sample, filterTCGA02$sampleID, "sample");
ExprSubsetBySamp = fread("TcgaTargetGtex_gene_expected_count.gz",
                         select = filterExpr)


###STEP7 start from here
probemap = fread("zz_gencode.v23.annotation.csv", select = c(1, 2));
exprALL = merge(probemap, ExprSubsetBySamp, by.x = "sampleID", by.y = "sample");
genesPC = fread("zz_gene.protein.coding.csv");
exprPC = subset(exprALL, gene %in% genesPC$Gene_Symbol);

R GTEx studio TCGA • 701 views

ADD COMMENT • link updated 5 months ago by marco.barr ▴ 150 • written 5 months ago by Nusrat • 0

0

Entering edit mode

the column 'gene' does not exist in the exprALL data frame. Inspect dataframe. It could be that exprALL has a different column name for genes, such as Gene_Symbol or something similar. Can you show head(exprALL)?

ADD REPLY • link 5 months ago by marco.barr ▴ 150

0

Entering edit mode

Thanks for responding. Let me show you all the column names for these files. Also adding links of these files for your understanding.

"zz_gencode.v23 annotation.csv" downloaded from this link below and renamed as such but also since in this code "select = c(1, 2)" is given so is it possible that problem is occuring for that? I mean there are so many columns in this file.

https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz

"zz _gene-protein.coding.csv" downloaded as genes.xlsx from this link

https://osf.io/edjzv/

enter image description here

ADD REPLY • link 5 months ago by Nusrat • 0

0

Entering edit mode

I believe you, it's a file of 1.20 GB... Looking at the data, it seems that something went wrong in the merge. Could you show me ExprSubsetBySamp, please? Since the goal is to merge these data frames only for the selected columns

ADD REPLY • link 5 months ago by marco.barr ▴ 150

0

Entering edit mode

ExprSubsetBySamp is the big file that you just mentioned of 1.20 GB.
Here's its column. enter image description here

ADD REPLY • link 5 months ago by Nusrat • 0

0

Entering edit mode

Please do not post screenshots of text content. You can copy/paste the text in edit window. Then use 101010 button to format it as code.

ADD REPLY • link 5 months ago by GenoMax 147k

0

Entering edit mode

You are filtering ExprSubsetBySamp using filterExpr, which contains only the columns sample_id and sample. It's expected that it only has one column that is not gene. You need to include the gene column from the TcgaTargetGtex_gene_expected_count.gz file in this selection; otherwise, you will never get a match between genes and sample names. Therefore, the issue lies upstream in the data source. I can't download your file since I'm not in lab, but you need to inspect the count.gz file to see if it has a gene column.

ADD REPLY • link 5 months ago by marco.barr ▴ 150