Hello, everyone.
I want to download paired samples from the TCGA database of RNA-Seq experiments. I've been looking for information on how to download this data and it looks like it's from https://gdac.broadinstitute.org/. Specifically, I'm looking for breast cancer samples, so I looked in the mRNASeq section.
Inside this section there are several files to download (I want the raw counts) but I don't know what the differences are between the files:
illuminahiseq_rnaseqv2-RSEM_genes (MD5) illuminahiseq_rnaseq-gene_expression (MD5)
I would also like to know how to filter these files to keep the paired samples. I found something about the sample codes in a previous post, but I couldn't access the link to the explanation (https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode).
Could you help me with this problem please?
PS: Suggestions about other databases are welcome!
Hey, Kevin, thanks for your answer.
Following your advice, I took the data from the illuminahiseq_rnaseq-gene_expression (MD5) file of the TCGA database (for breast cancer) and found 97 paired samples. For this purpose, I have filtered the samples that were duplicated per patient, for example: TCGA.A6.2675.11A.01R.1723.07 and TCGA.A6.2675.01A.02R.1723.07. In this case, the first would refer to healthy tissue and the second to tumor tissue.
Am I doing something wrong? I'm saying this because of your previous answer where you said there were about 111 paired samples.
I have also analyzed the available data on COAD and have only found 26 samples paired by this method.
Hey, yes, the first sample is healthy tissue, whilst the second is tumour.
The number of matched paired Tumour and Normals will vary based on the exact data that you obtain, and also the filtering that's applied on samples. 97 is a number that I've seen, too! It varies.