Entering edit mode
15 months ago
wrab425
▴
50
How can one establish the batch information for a TCGA gene expression quantitation TSV file?
How can one establish the batch information for a TCGA gene expression quantitation TSV file?
I add samples to cart at GDC data portal, then downloaded them. I merge them with the R code below. Hope this will help
library(data.table)
library(tidyverse)
##You can generate gdc_sample_sheet.tsv at GDC data portal
index=read.table("gdc_sample_sheet.tsv",sep="\t",header=TRUE)
index=index[order(index$Sample.ID),]
##read files
setwd("where_you_download_your_data")
expr_file=index$File.Name
mat=do.call(cbind,lapply(as.character(expr_file),function(x){fread(x,header=T,sep="\t")[,c(4)]}))
exp_mat=read.table(as.character(index$File.Name[1]),sep="\t",header=T)
mat=data.frame(exp_mat$gene_id,exp_mat$gene_name,exp_mat$gene_type,mat)
mat=mat[5:nrow(mat),]
colnames(mat)=c("ensembl_gene_id","hgnc_symbol","gene_biotype",index$new_id)
ensg_id=unlist(strsplit(as.character(mat$ensembl_gene_id),split="[.]"))
ensg_id=ensg_id[grep("ENSG*",ensg_id)]
mat$ensembl_gene_id=ensg_id
write.table(mat,"TCGA.tsv",row.names = FALSE,col.names = TRUE,sep="\t",quote=FALSE)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
one coding suggestion
using do.call cbind may have performance issue when dealing with large amount of data. each call here will create another data object, and replace the initial one.
predicting the size of the final data frame/matrix, and pre-create that object in advance is likely to be much faster for large amount of data.