How to extract count matrix from TCGAbiolinks using STAR-Counts
4
1
Entering edit mode
2.7 years ago
Xiaofan ▴ 10

Hi there,

I always used TCGAbiolinks to get raw count for TCGA projects like below:

expquery <- GDCquery(project = "TCGA-KIRC",
                 data.category = "Transcriptome Profiling",
                 data.type = "Gene Expression Quantification",
                 workflow.type = "HTSeq - Counts")
GDCdownload(expquery,directory = "GDCdata")
expquery2 <- GDCprepare(expquery,directory = "GDCdata",summarizedExperiment = T)
expMatrix <- TCGAanalyze_Preprocessing(expquery2)

However, it does not work today and it seems there is no HTSeq - Counts

Error in GDCquery(project = "TCGA-KIRC", data.category = "Transcriptome Profiling",  : 
  Please set a valid workflow.type argument from the list below:
  => STAR - Counts

Therefore, I used STAR - Counts to download the data, which has completely different format of what I downloaded before using HTSeq - Count. The expression matrix for each sample has more columns including fpkm_unstranded and tpm_unstranded.

Actually the data downloaded using STAR - Counts is much more useful but I do not know how to extract the files to a readable expression matrix (ENSEMBL ID as rownames and TCGA tumor barcode as column names) because the GDCprepare() function fails to work on it:

expquery2 <- GDCprepare(expquery,directory = "GDCdata",summarizedExperiment = T)
|                                                          |  0%                      
Error in readr::read_tsv(file = f, col_names = TRUE, progress = FALSE,  : 
  unused argument (show_col_types = FALSE)
Error in if (value == n) { : argument is of length zero

Anyone has any solutions? Many thanks in advance!

TCGA TCGAbiolinks R • 7.4k views
ADD COMMENT
0
Entering edit mode

I'm stuck too

ADD REPLY
0
Entering edit mode

lol, let's wait and see if any hero can save us!

ADD REPLY
4
Entering edit mode
2.7 years ago
Timothee ▴ 80

I have found the solution! Just update the TCGAbiolink package from the Github project, here is the link: https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/495

I had some problems during the installation, I had to reinstall some packages to make it work, here they are:

library(GenomicDataCommons) library(GenomeInfoDbData) library(SummarizedExperiment) library(ExperimentHub) library(AnnotationHub)

ADD COMMENT
0
Entering edit mode

Thank you for this update!

However, I have seemed updated the package to 2.23.6 which should be the updated version from GitHub. I still get the same error that there is no HTSeq - Counts.

May I ask which version you are using for TCGAbiolinks?

ADD REPLY
0
Entering edit mode

I used the code that you mentioned in the issue495 (download STAR - Count data):

GDCdownload(query, method = "api")
rnaseq <- GDCprepare(query)

But the GDCprepare did not work. I do not know why, the error is the same.

ADD REPLY
2
Entering edit mode
2.7 years ago
Timothee ▴ 80
projects <- c("TCGA-LGG","TCGA-GBM")

ids <- c("TCGA-76-6661", "TCGA-26-5132", "TCGA-19-1389")


# RNA Transcriptome
query <- GDCquery(
  project = projects,
  data.category = "Transcriptome Profiling",
  data.type = "Gene Expression Quantification",
  workflow.type = "STAR - Counts",
  barcode = ids
)

GDCdownload(query,
            method = "api", 
            directory = "./data",
            files.per.chunk  =  10)

rna <- GDCprepare(query, save = FALSE, summarizedExperiment = T, directory = "./data", remove.files.prepared = F)
rna <- as.data.frame(rna@assays@data@listData)



# DNA Meth Beta Value
query <-GDCquery(project = projects,
                     data.category= "DNA Methylation",
                     platform = "Illumina Human Methylation 450",
                     data.type = "Methylation Beta Value",
                     legacy = FALSE,
                     barcode = ids)

GDCdownload(query,
            method = "api", 
            directory = "./data",
            files.per.chunk  =  10)


met <- GDCprepare(query, save = FALSE, summarizedExperiment = T, directory = "./data", remove.files.prepared = FALSE)


# IDAT Files
query <-GDCquery(project = projects,
                   data.category = "Raw microarray data",
                   data.type = "Raw intensities", 
                   experimental.strategy = "Methylation array", 
                   legacy = TRUE,
                   file.type = ".idat",
                   platform = "Illumina human methylation 450",
                   barcode = ids)

GDCdownload(query,
            method = "api", 
            directory = "./data",
            files.per.chunk  =  10)
ADD COMMENT
0
Entering edit mode
2.7 years ago
Timothee ▴ 80

This is due to a new update of the TCGA.

Here is what the latest release says: Files that have been processed after data release 25 will be associated with genetic fusion files. As of data release 32, the reference annotation will be updated to GENCODE v36 and HT-Seq will no longer be used. For more information: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#fusion-pipelines

However, I don't have the answer to the problem, I am, like you, stuck.

ADD COMMENT
0
Entering edit mode
2.7 years ago
Timothee ▴ 80

Uninstall the old version and then reinstall the Github version:

BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

Here is my InfoSession, I hope it will help you.

R version 4.1.3 (2022-03-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.3

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] filesstrings_3.2.2          readr_2.1.2                 dplyr_1.0.8                
 [4] ggplot2_3.3.5               stringr_1.4.0               pracma_2.3.8               
 [7] TCGAbiolinks_2.23.6         SummarizedExperiment_1.24.0 Biobase_2.54.0             
[10] GenomicRanges_1.46.1        GenomeInfoDb_1.30.1         IRanges_2.28.0             
[13] S4Vectors_0.32.4            BiocGenerics_0.40.0         MatrixGenerics_1.6.0       
[16] matrixStats_0.61.0         

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                  bit64_4.0.5                   filelock_1.0.2               
 [4] progress_1.2.2                httr_1.4.2                    tools_4.1.3                  
 [7] utf8_1.2.2                    R6_2.5.1                      DBI_1.1.2                    
[10] colorspace_2.0-3              withr_2.5.0                   tidyselect_1.1.2             
[13] prettyunits_1.1.1             bit_4.0.4                     curl_4.3.2                   
[16] compiler_4.1.3                rvest_1.0.2                   cli_3.2.0                    
[19] xml2_1.3.3                    DelayedArray_0.20.0           scales_1.1.1                 
[22] rappdirs_0.3.3                digest_0.6.29                 R.utils_2.11.0               
[25] XVector_0.34.0                pkgconfig_2.0.3               htmltools_0.5.2              
[28] dbplyr_2.1.1                  fastmap_1.1.0                 strex_1.4.2                  
[31] rlang_1.0.2                   RSQLite_2.2.11                shiny_1.7.1                  
[34] generics_0.1.2                jsonlite_1.8.0                R.oo_1.24.0                  
[37] RCurl_1.98-1.6                magrittr_2.0.3                GenomeInfoDbData_1.2.7       
[40] Matrix_1.4-1                  Rcpp_1.0.8.3                  munsell_0.5.0                
[43] fansi_1.0.3                   lifecycle_1.0.1               R.methodsS3_1.8.1            
[46] stringi_1.7.6                 yaml_2.3.5                    zlibbioc_1.40.0              
[49] plyr_1.8.7                    BiocFileCache_2.2.1           AnnotationHub_3.2.2          
[52] grid_4.1.3                    blob_1.2.2                    promises_1.2.0.1             
[55] ExperimentHub_2.2.1           crayon_1.5.1                  lattice_0.20-45              
[58] Biostrings_2.62.0             hms_1.1.1                     KEGGREST_1.34.0              
[61] knitr_1.38                    pillar_1.7.0                  TCGAbiolinksGUI.data_1.14.0  
[64] biomaRt_2.50.3                XML_3.99-0.9                  glue_1.6.2                   
[67] BiocVersion_3.14.0            downloader_0.4                data.table_1.14.2            
[70] BiocManager_1.30.16           png_0.1-7                     vctrs_0.4.0                  
[73] tzdb_0.3.0                    httpuv_1.6.5                  tidyr_1.2.0                  
[76] gtable_0.3.0                  purrr_0.3.4                   assertthat_0.2.1             
[79] cachem_1.0.6                  xfun_0.30                     mime_0.12                    
[82] xtable_1.8-4                  later_1.3.0                   tibble_3.1.6                 
[85] AnnotationDbi_1.56.2          memoise_2.0.1                 ellipsis_0.3.2               
[88] interactiveDisplayBase_1.32.0

Sincerely

ADD COMMENT
0
Entering edit mode

Thank you so much. May I know your code for this updated version to get count matrix? I would like to see if I can make it....

ADD REPLY
2
Entering edit mode
# Defines the query to the GDC
query <- GDCquery(project = TCGA-OV,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Download data using api
GDCdownload(query, method = "api")

#Prepare data
data <- GDCprepare(query)

#generate count matrix
mrna = assay(data)
write.table(mrna, "ov_count_matrix_mrna.txt", sep='\t')
ADD REPLY

Login before adding your answer.

Traffic: 1638 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6