Question

TCGA Broad GDAC Firehose Parse and Match

0

Entering edit mode

6.8 years ago

hood821 • 0

Hello, I have downloaded all the cancer types from broad's GDAC Firehose and I've unzipped. There are a ton of files the mage-tab, aux, and level for each piece of data (clinical, rna-seq, protein). I was hoping to find some already established code (R or python) that pulls only the "level" files, pulls the txt files for clinical data, and rna-seq data into an RObject for that cancer type. This would map the sample data identifier to the clinical data identifier, there are so many tcga id's it's hard to parse.

I thought this would be something that is commonly done all the time. I can write the code but I am slow and don't want to reinvent the wheel. I want all cancer types, with clinical variables and rna-seq RSEM data into an RObject for each type. Oh, and I want a way to toggle whether or not the sample is "normal". I think I can pull this from the clinical file.

Any help or pointers would be great!

Thanks!

rna-seq R • 2.3k views

ADD COMMENT • link updated 6.8 years ago by vinvan ▴ 50 • written 6.8 years ago by hood821 • 0

score 0 · Answer 1 · 2018-02-14

0

Entering edit mode

6.8 years ago

vinvan ▴ 50

There are quite a few R packages out there that do exactly this. You can check TCGABiolinks or TCGA2STAT.

ADD COMMENT • link 6.8 years ago by vinvan ▴ 50

0

Entering edit mode

Sure, I have seen these. My concern is what happens in processing. Is there normalization? Are there samples dropped, if so , why? I want the data with as little manipulation as possible.

ADD REPLY • link 6.8 years ago by hood821 • 0