Question

Matched tumor/normal RNASeq expression counts in TCGA

1

Entering edit mode

3.5 years ago

ariel ▴ 250

I would like to download matched tumor/normal expression data from TCGA. I would also sample data like age, sex, race, etc.

I looked at the files

Mutations: mc3.v0.2.8.PUBLIC.maf.gz RNA: EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv

found here: https://gdc.cancer.gov/about-data/publications/pancanatlas

The RNA file is a matrix of expression data with rows->genes, and cols->barcodes. The maf file has the mutation data (kinda like vcf). It has columns for normal barcodes and tumor barcodes.

I was hoping I could match those to the columns in the expression data. However, each column has roughly the same number of barcodes as the expression data (columns), and there is no intersection.

Moreover, when I jointly count case barcode and sample type, there is only one sample type per person. So no tumor/normal pairs.

In the GDC data access portal, I tried creating a manifest like this

https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.access%22%2C%22value%22%3A%5B%22open%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20Counts%22%2C%22STAR%20-%20Counts%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Gene%20Expression%20Quantification%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D

But it doesn't have a project selector, and I end up with >20k files, which is more than it will put in a manifest.

This gives >15k cases https://portal.gdc.cancer.gov/repository?filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20Counts%22%2C%22STAR%20-%20Counts%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_type%22%2C%22value%22%3A%5B%22Gene%20Expression%20Quantification%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D&searchTableTab=cases

I've also looked at the ISB-CGC BigQuery tables.

https://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/BigQuery.html

I found the expression data and case data. These are pretty much the same as the pancan download files. Again, there is only one "SampleType" for each "ParticipantBarcode" or "CaseBarcode", so are there tumor/normal pairs?

I'm stumped.

tumor TCGA RNASeq expression normal • 1.2k views

ADD COMMENT • link updated 2.6 years ago by rmelias2 • 0 • written 3.5 years ago by ariel ▴ 250

0

Entering edit mode

I am having a similar issue - I am searching for a master file which links the aliquot barcode (column names in "EBPlusPlusAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.tsv" and sample level data such as "type" (primary tumor vs adjacent normal).

ADD REPLY • link 2.6 years ago by rmelias2 • 0

score 1 · Answer 1 · 2022-03-01

1

Entering edit mode

3.3 years ago

Zhenyu Zhang ★ 1.3k

TCGA cases do not have normal RNA-Seq with very few exceptions (less than 1%)

ADD COMMENT • link 3.3 years ago by Zhenyu Zhang ★ 1.3k