Hi,
As I needed RNAseq raw sequencing data I downloaded the rnaseq manifest file from GDC legacy archive and with the token I downloaded rnaseq raw data.
The manifest looks like this:
id filename md5 size state
d1017f74-3a39-4427-af57-273e34247b49 UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz ed7f23aa9540ef0242cb6ddde30d1aca 5830465428 live
5e2d5c52-596f-49bc-967c-42129abbacbf UNCID_2208720.71b58051-3bf8-4dfb-a431-c8aceab7c799.110608_UNC13-SN749_0073_BD0CV8ABXX_2.tar.gz b1f03852b2ac3c3cd50cb4a87f2a116a 7587398372 live
2ef74f93-5da2-454c-aca2-d86c289eacb8 UNCID_2206802.25be50e7-7705-492d-a44a-0e40180d10c8.110901_UNC12-SN629_0127_BC025UABXX_1_CTTGTA.tar.gz a965da78ada814a35702fd65209b500a 7867889236 live
e01ca3e0-beb0-46b7-bb7c-f5b16f966918 UNCID_2521679.d817dcee-1322-4949-a6e9-138447e6fc56.140417_UNC13-SN749_0343_BC41HBACXX_5_CTTGTA.tar.gz 6e6a26fcce8e84d209b1475249a922de 5187498148 live
992a7083-28ce-4857-898e-9d4b4fbf2fa1 UNCID_2319278.bf92b8cc-9a5c-4e96-917c-c264fe588f8d.131118_UNC12-SN629_0336_AC31D0ACXX_5_ACTTGA.tar.gz bb9e19a5f286ff37bf95cb0c307930ea 6717741168 live
230082b7-39ec-4fe1-b3c6-daf35458f396 UNCID_2206889.526da11e-9125-4fcd-98d7-02994c9783d1.110810_UNC10-SN254_0263_AB09WEABXX_3_CAGATC.tar.gz d93777efebc921e2539aa2b7081da6d4 4766879929 live
9bbada51-d827-4eea-af45-47d7b5ba137e UNCID_2206522.147d6ebb-7359-449a-9e6a-6c8443ebaa2e.110919_UNC13-SN749_0113_AB00WUABXX_3_CGATGT.tar.gz 7bcfe71256ba172fa605bc4ddc04f9c7 7606309309 live
db1b68b0-dc0a-48a5-8acb-4cd45ea186e2 UNCID_2664315.22fe5cac-0623-4d0a-a158-f15fb5477d8f.120710_UNC12-SN629_0215_BC0WRNACXX_3_CTTGTA.tar.gz 281083ac338f67145ede3d8ef3f4300f 7504345358 live
After the download I have folders with names present in "id" column. Inside each folder there is tar.gz file.
For eg:
d1017f74-3a39-4427-af57-273e34247b49
|___ UNCID_2207021.7b9569bc-f513-4b64-9a7c-7bb53b9be79b.110801_UNC12-SN629_0115_BD0DVEABXX_3_ACAGTG.tar.gz
When I extracted the tar.gz files I got the fastq files like below:
110801_UNC12-SN629_0115_BD0DVEABXX.3_1.fastq
110801_UNC12-SN629_0115_BD0DVEABXX.3_2.fastq
What is the sample name here? Looks very confused.
As this is controlled data, you could log in at the main GDC ( https://portal.gdc.cancer.gov/ ) and use the search box to search for the file-names - they should be there. You would then obviously look for the UUID or TCGA barcode.
To do this programmatically, there are APIs but, the last time that I tried them, they were offline. There has been a lot of data being moved around relatively recently for the TCGA. One way that I did it was to download the JSON manifest for my data (from the Legacy Archive) and then use a loop in R to pull out the CASE ID (which is the UUID), in this case, which I then used to identify the patients. Here's the loop that I used (slow; sample filenames are in filenames object):
sorry, didn't get what is filenames object. what is sample filenames?
Just a vector of your filenames, such as:
As given above in the manifest I already have UUID. What I need is TCGA sample name. For this do I need to login into GDC and search?
I see. For UUID-to-TCGA barcode mapping, I was able to just use one of the clinical data files in BioTab format (also available at Legacy Archive).
For example, here is the file for breast cancer: https://portal.gdc.cancer.gov/legacy-archive/files/735bc5...
The first 2 columns of that file are:
Somehow the below code is not working anymore. However "UUIDtoBarcode" function from TCGAutils R package is giving the solution.
That is expected. The function that I wrote above is inefficient and served a specific purpose at that time. Did you try Vasu's (Sean Davis's) code below?