Question

How to extract count matrix from TCGAbiolinks using STAR-Counts

1

Entering edit mode

3.3 years ago

Xiaofan ▴ 10

Hi there,

I always used TCGAbiolinks to get raw count for TCGA projects like below:

expquery <- GDCquery(project = "TCGA-KIRC",
                 data.category = "Transcriptome Profiling",
                 data.type = "Gene Expression Quantification",
                 workflow.type = "HTSeq - Counts")
GDCdownload(expquery,directory = "GDCdata")
expquery2 <- GDCprepare(expquery,directory = "GDCdata",summarizedExperiment = T)
expMatrix <- TCGAanalyze_Preprocessing(expquery2)

However, it does not work today and it seems there is no HTSeq - Counts

Error in GDCquery(project = "TCGA-KIRC", data.category = "Transcriptome Profiling",  : 
  Please set a valid workflow.type argument from the list below:
  => STAR - Counts

Therefore, I used STAR - Counts to download the data, which has completely different format of what I downloaded before using HTSeq - Count. The expression matrix for each sample has more columns including fpkm_unstranded and tpm_unstranded.

Actually the data downloaded using STAR - Counts is much more useful but I do not know how to extract the files to a readable expression matrix (ENSEMBL ID as rownames and TCGA tumor barcode as column names) because the GDCprepare() function fails to work on it:

expquery2 <- GDCprepare(expquery,directory = "GDCdata",summarizedExperiment = T)
|                                                          |  0%                      
Error in readr::read_tsv(file = f, col_names = TRUE, progress = FALSE,  : 
  unused argument (show_col_types = FALSE)
Error in if (value == n) { : argument is of length zero

Anyone has any solutions? Many thanks in advance!

TCGA TCGAbiolinks R • 9.6k views

ADD COMMENT • link updated 3.2 years ago by DareDevil ★ 4.4k • written 3.3 years ago by Xiaofan ▴ 10

0

Entering edit mode

I'm stuck too

ADD REPLY • link 3.3 years ago by Kira • 0

0

Entering edit mode

lol, let's wait and see if any hero can save us!

ADD REPLY • link 3.3 years ago by Xiaofan ▴ 10

0

Entering edit mode

3.3 years ago

Timothee ▴ 80

This is due to a new update of the TCGA.

Here is what the latest release says: Files that have been processed after data release 25 will be associated with genetic fusion files. As of data release 32, the reference annotation will be updated to GENCODE v36 and HT-Seq will no longer be used. For more information: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#fusion-pipelines

However, I don't have the answer to the problem, I am, like you, stuck.

ADD COMMENT • link 3.3 years ago by Timothee ▴ 80

0

Entering edit mode

3.3 years ago

Timothee ▴ 80

Uninstall the old version and then reinstall the Github version:

BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

Here is my InfoSession, I hope it will help you.

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.3

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] filesstrings_3.2.2          readr_2.1.2                 dplyr_1.0.8                
 [4] ggplot2_3.3.5               stringr_1.4.0               pracma_2.3.8               
 [7] TCGAbiolinks_2.23.6         SummarizedExperiment_1.24.0 Biobase_2.54.0             
[10] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0             
[13] S4Vectors_0.32.4            BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
[16] matrixStats_0.61.0         

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                  bit64_4.0.5                   filelock_1.0.2               
 [4] progress_1.2.2                httr_1.4.2                    tools_4.1.3                  
 [7] utf8_1.2.2                    R6_2.5.1                      DBI_1.1.2                    
[10] colorspace_2.0-3              withr_2.5.0                   tidyselect_1.1.2             
[13] prettyunits_1.1.1             bit_4.0.4                     curl_4.3.2                   
[16] compiler_4.1.3                rvest_1.0.2                   cli_3.2.0                    
[19] xml2_1.3.3                    DelayedArray_0.20.0           scales_1.1.1                 
[22] rappdirs_0.3.3                digest_0.6.29                 R.utils_2.11.0               
[25] XVector_0.34.0                pkgconfig_2.0.3               htmltools_0.5.2              
[28] dbplyr_2.1.1                  fastmap_1.1.0                 strex_1.4.2                  
[31] rlang_1.0.2                   RSQLite_2.2.11                shiny_1.7.1                  
[34] generics_0.1.2                jsonlite_1.8.0                R.oo_1.24.0                  
[37] RCurl_1.98-1.6                magrittr_2.0.3                GenomeInfoDbData_1.2.7       
[40] Matrix_1.4-1                  Rcpp_1.0.8.3                  munsell_0.5.0                
[43] fansi_1.0.3                   lifecycle_1.0.1               R.methodsS3_1.8.1            
[46] stringi_1.7.6                 yaml_2.3.5                    zlibbioc_1.40.0              
[49] plyr_1.8.7                    BiocFileCache_2.2.1           AnnotationHub_3.2.2          
[52] grid_4.1.3                    blob_1.2.2                    promises_1.2.0.1             
[55] ExperimentHub_2.2.1           crayon_1.5.1                  lattice_0.20-45              
[58] Biostrings_2.62.0             hms_1.1.1                     KEGGREST_1.34.0              
[61] knitr_1.38                    pillar_1.7.0                  TCGAbiolinksGUI.data_1.14.0  
[64] biomaRt_2.50.3                XML_3.99-0.9                  glue_1.6.2                   
[67] BiocVersion_3.14.0            downloader_0.4                data.table_1.14.2            
[70] BiocManager_1.30.16           png_0.1-7                     vctrs_0.4.0                  
[73] tzdb_0.3.0                    httpuv_1.6.5                  tidyr_1.2.0                  
[76] gtable_0.3.0                  purrr_0.3.4                   assertthat_0.2.1             
[79] cachem_1.0.6                  xfun_0.30                     mime_0.12                    
[82] xtable_1.8-4                  later_1.3.0                   tibble_3.1.6                 
[85] AnnotationDbi_1.56.2          memoise_2.0.1                 ellipsis_0.3.2               
[88] interactiveDisplayBase_1.32.0

Sincerely

ADD COMMENT • link 3.3 years ago by Timothee ▴ 80

0

Entering edit mode

Thank you so much. May I know your code for this updated version to get count matrix? I would like to see if I can make it....

ADD REPLY • link 3.3 years ago by Xiaofan ▴ 10

2

Entering edit mode

# Defines the query to the GDC
query <- GDCquery(project = TCGA-OV,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Download data using api
GDCdownload(query, method = "api")

#Prepare data
data <- GDCprepare(query)

#generate count matrix
mrna = assay(data)
write.table(mrna, "ov_count_matrix_mrna.txt", sep='\t')

ADD REPLY • link 3.2 years ago by DareDevil ★ 4.4k

score 4 · Accepted Answer · 2022-03-30

4

Entering edit mode

3.3 years ago

Timothee ▴ 80

I have found the solution! Just update the TCGAbiolink package from the Github project, here is the link: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/495

I had some problems during the installation, I had to reinstall some packages to make it work, here they are:

library(GenomicDataCommons) library(GenomeInfoDbData) library(SummarizedExperiment) library(ExperimentHub) library(AnnotationHub)

ADD COMMENT • link 3.3 years ago by Timothee ▴ 80

0

Entering edit mode

Thank you for this update!

However, I have seemed updated the package to 2.23.6 which should be the updated version from GitHub. I still get the same error that there is no HTSeq - Counts.

May I ask which version you are using for TCGAbiolinks?

ADD REPLY • link 3.3 years ago by Xiaofan ▴ 10

0

Entering edit mode

I used the code that you mentioned in the issue495 (download STAR - Count data):

GDCdownload(query, method = "api")
rnaseq <- GDCprepare(query)

But the GDCprepare did not work. I do not know why, the error is the same.

ADD REPLY • link 3.3 years ago by Xiaofan ▴ 10

score 2 · Accepted Answer · 2022-03-31

projects <- c("TCGA-LGG","TCGA-GBM")

ids <- c("TCGA-76-6661", "TCGA-26-5132", "TCGA-19-1389")


# RNA Transcriptome
query <- GDCquery(
  project = projects,
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts",
  barcode = ids
)

GDCdownload(query,
            method = "api", 
            directory = "./data",
            files.per.chunk  =  10)

rna <- GDCprepare(query, save = FALSE, summarizedExperiment = T, directory = "./data", remove.files.prepared = F)
rna <- as.data.frame(rna@assays@data@listData)



# DNA Meth Beta Value
query <-GDCquery(project = projects,
                     data.category= "DNA Methylation",
                     platform = "Illumina Human Methylation 450",
                     data.type = "Methylation Beta Value",
                     legacy = FALSE,
                     barcode = ids)

GDCdownload(query,
            method = "api", 
            directory = "./data",
            files.per.chunk  =  10)


met <- GDCprepare(query, save = FALSE, summarizedExperiment = T, directory = "./data", remove.files.prepared = FALSE)


# IDAT Files
query <-GDCquery(project = projects,
                   data.category = "Raw microarray data",
                   data.type = "Raw intensities", 
                   experimental.strategy = "Methylation array", 
                   legacy = TRUE,
                   file.type = ".idat",
                   platform = "Illumina human methylation 450",
                   barcode = ids)

GDCdownload(query,
            method = "api", 
            directory = "./data",
            files.per.chunk  =  10)