I would like to download matched tumor/normal expression data from TCGA. I would also sample data like age, sex, race, etc.
I looked at the files
Mutations: mc3.v0.2.8.PUBLIC.maf.gz RNA: EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv
found here: https://gdc.cancer.gov/about-data/publications/pancanatlas
The RNA file is a matrix of expression data with rows->genes, and cols->barcodes. The maf file has the mutation data (kinda like vcf). It has columns for normal barcodes and tumor barcodes.
I was hoping I could match those to the columns in the expression data. However, each column has roughly the same number of barcodes as the expression data (columns), and there is no intersection.
Moreover, when I jointly count case barcode and sample type, there is only one sample type per person. So no tumor/normal pairs.
In the GDC data access portal, I tried creating a manifest like this
But it doesn't have a project selector, and I end up with >20k files, which is more than it will put in a manifest.
I've also looked at the ISB-CGC BigQuery tables.
https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery.html
I found the expression data and case data. These are pretty much the same as the pancan download files. Again, there is only one "SampleType" for each "ParticipantBarcode" or "CaseBarcode", so are there tumor/normal pairs?
I'm stumped.
I am having a similar issue - I am searching for a master file which links the aliquot barcode (column names in "EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv" and sample level data such as "type" (primary tumor vs adjacent normal).