Tutorial:Generating count matrix for STAR counts in GDC v32.0 for RNA-Seq
1
3
Entering edit mode
2.6 years ago
DareDevil ★ 4.3k
## Load the required library
library('TCGAbiolinks')
project_name <- "TCGA-ACC"

## Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

## Get metadata matrix
metadata <- query[[1]][[1]]

## Download data using api
GDCdownload(query, method = "api")

## Get main directory where data is stored
main_dir <- file.path("GDCdata", project_name)
## Get file list of downloaded files
file_list <- file.path("GDCdata", project_name,list.files(main_dir,recursive = TRUE)) 

## Read first downloaded to get gene names
test_tab <- read.table(file = file_list[1], sep = '\t', header = TRUE)
## Delete header lines that don't contain usefull information
test_tab <- test_tab[-c(1:4),]
## STAR counts and tpm datasets
tpm_data_frame <- data.frame(test_tab[,1])
count_data_frame <- data.frame(test_tab[,1])

## Append cycle to get the complete matrix
for (i in c(1:length(file_list))) {
  ## Read table
  test_tab <- read.table(file = file_list[i], sep = '\t', header = TRUE)
  ## Delete unwanted lines
  test_tab <- test_tab[-c(1:4),]
  ## Column bind of tpm and counts data
  tpm_data_frame <- cbind(tpm_data_frame, test_tab[,7])
  count_data_frame <- cbind(count_data_frame, test_tab[,4])
  ## Print progres from 0 to 1
  print(i/length(file_list))
}
GDC TCGABiolinks STAR • 3.3k views
ADD COMMENT
0
Entering edit mode

Thank you for the tutorial. I've looked all over the net and couldn't find how to load TPM data (most forums post how to retrieve count). I have one problem though, after getting tpm_data_frame, the header becomes test_tab[,7] for all columns. How do I retrieve the unique sample ID for each column? Thank you.

ADD REPLY
0
Entering edit mode
21 months ago
Pearl • 0

Please have a look at this: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/493. I think we can get TPM counts directly from TCGAbiolink.

############### download OV data from TCGA ################
# defines the query to the GDC
query <- GDCquery(project = "TCGA-OV",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")


# download query data using api
GDCdownload(query = query, method = "api")

# prepare data
data <- GDCprepare(query)
data

Starting to add information to samples
 => Add clinical information to samples
 => Adding TCGA molecular information from marker papers
 => Information will have prefix 'paper_' 
Available assays in SummarizedExperiment : 
  => unstranded
  => stranded_first
  => stranded_second
  => tpm_unstrand
  => fpkm_unstrand
  => fpkm_uq_unstrand

## generate TPM counts matrix
TPM <- assay(data,4) %>% data.frame()
TPM[1:3,1:3]
ADD COMMENT
0
Entering edit mode

TPM != counts.

ADD REPLY

Login before adding your answer.

Traffic: 1634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6