I want to pull off RNASeq data from TCGA... but datasets are available in both HiSeq and GA platforms. Can someone please advise which dataset would be better to use and why? Thanks!
I want to pull off RNASeq data from TCGA... but datasets are available in both HiSeq and GA platforms. Can someone please advise which dataset would be better to use and why? Thanks!
Looking at the COAD data set in the data matrix, it appears that there's no overlap in the v2 RNAseq. Either a sample has GA data or HiSeq data. So what you use will depend on the questions you're asking. If you need all the samples, use both. If you're worried about sequencer-driven batch effects, the GA produced the vast amount of the data, so you'd probably want to exclude the HiSeq.
FWIW, I'd just use all of it - as long as the read lengths are similar, there's unlikely to be much in the way of difference - the chemistry is largely the same.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
You are going to have to be way more specific. There are something like 20 different cancer types in TCGA, and the data was produced across 5 years at different centers. The platforms evolved along with the project, and so the data spans a wide variety of sequencing platforms. (even early hiseq may be quite different than late hiseq in terms of read lengths, etc.)
Thanks, Chris. I am looking for colorectal adenocarcinoma, specifically.
These two options are quite the same in terms of the chemistry of the sequencing, but the HiSeq device is a newer sequencer so theoretically I would go with that data..
https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2