I am searching for an alternative to extract data from TCGA. I know this one http://www.linkedomics.org/admin.php but there was another web which was very easy but I don't remember. Any thought ?
I am searching for an alternative to extract data from TCGA. I know this one http://www.linkedomics.org/admin.php but there was another web which was very easy but I don't remember. Any thought ?
Hi,
there are various repositories and R packages for accessing and downloading TCGA data. For example, take a look at the TCGAbiolinks R package, with various options, including both raw data as also processed, harmonized etc: http://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html
Also other projects, such as the recount2 source:
https://jhubiostatistics.shinyapps.io/recount/
But the most important question is actually what is your biological question of interest ? and what would you like to search ?
Best,
Efstathios-Iason
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
LinkedOmics to extract TCGA data? LinkedOmics is an analysis tool, and is quite removed from TCGA's raw data. (I'm part of the lab that developed LinkedOmics)
Here you go TCGA Assembler is another one apart from TCGAbiolinks and recount. Just remember all these will only provide you with matrix of count data or FPKM/TPM . This means they are quantified with RSEM. You cannot obtain raw data unless you specifically apply for GDC and ICGC approval and gain the access. Hope this helps.!
Dear vchris,
just a small comment : actually, not all these resourses/tools/projects are quantified with RSEM. This is "mostly TRUE" for "legend" TCGA data in the old repository (level 3 or 4)-for example, the harmonized versions with GDC:
https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#mrna-expression-workflow
which can be accessed from TCGAbiolinks, can produce raw HTSeq gene counts, etc.
Also you can query raw sequencing data:
example
Right, my apologies, I should have clarified it better. Mostly the legendary data with RSEM while one can also access STAR aligned data and retrieve counts via HT-Seq and normalized data as well. But to access BAMs (qualifies for raw data since you can go back to prepared the fastq files from them or raw fastq files) you need the higher access. Just to clarify the Raw sequencing in GDCquery if I am not wrong will not be able to download the raw legacy data unless one have the token file and has access to the controlled data. Check this link
P.S: I have spent days and nights to find a way to download raw data without access control but later got the access with tokens, and still its not very straight forward post access. ;)