Question

TCGA gene expression quantitation batch information

1

Entering edit mode

20 months ago

wrab425 ▴ 50

How can one establish the batch information for a TCGA gene expression quantitation TSV file?

TCGA • 805 views

ADD COMMENT • link updated 19 months ago by Zhenyu Zhang ★ 1.2k • written 20 months ago by wrab425 ▴ 50

score 0 · Answer 1 · 2023-08-21

I add samples to cart at GDC data portal, then downloaded them. I merge them with the R code below. Hope this will help

library(data.table)
library(tidyverse)

##You can generate gdc_sample_sheet.tsv at GDC data portal
index=read.table("gdc_sample_sheet.tsv",sep="\t",header=TRUE)
index=index[order(index$Sample.ID),]

##read files
setwd("where_you_download_your_data")
expr_file=index$File.Name
mat=do.call(cbind,lapply(as.character(expr_file),function(x){fread(x,header=T,sep="\t")[,c(4)]}))
exp_mat=read.table(as.character(index$File.Name[1]),sep="\t",header=T)
mat=data.frame(exp_mat$gene_id,exp_mat$gene_name,exp_mat$gene_type,mat)
mat=mat[5:nrow(mat),]
colnames(mat)=c("ensembl_gene_id","hgnc_symbol","gene_biotype",index$new_id)
ensg_id=unlist(strsplit(as.character(mat$ensembl_gene_id),split="[.]"))
ensg_id=ensg_id[grep("ENSG*",ensg_id)]
mat$ensembl_gene_id=ensg_id
write.table(mat,"TCGA.tsv",row.names = FALSE,col.names = TRUE,sep="\t",quote=FALSE)