Duplicate samples in TCGA Breast cancer data. Which one to pick?
1
0
Entering edit mode
6.6 years ago
Vasu ▴ 790

Hi,

I have downloaded TCGA breast cancer data. A total of 1256 fastq files. I have the UUID's. So I used "Genomics Data Commons" package to get the TCGA-Barcodes for those UUID's. But I see duplicate matching sample names. Which one should I pick for the analysis?

         UUID                               samplenames
5516dd59-3d95-4bc6-84e7-5719b1bbcabf    TCGA-A7-A26F-01B
a907f2d1-92ad-4a1b-b439-20e5a7347d5b    TCGA-A7-A26F-01A
b570a72f-5e6c-4301-923b-9992662409ca    TCGA-A7-A26F-01B
ba22d7e6-3e70-4a43-9dc1-59069b39e8c2    TCGA-A7-A26F-01B
eb068925-2dcc-4e18-838f-903ac8d2b661    TCGA-A7-A26F-01A
RNA-Seq tcga breast gdc • 6.2k views
ADD COMMENT
0
Entering edit mode

yes But for gdc legacy data I dont see any aliquots like given in gdc harmonized data.

https://portal.gdc.cancer.gov/legacy-archive/files/a907f2d1-92ad-4a1b-b439-20e5a7347d5b

ADD REPLY
0
Entering edit mode

@Sean Davis Could you please tell me this. With "Genomics Data Commons" package I got the submitter id's for UUID's. But there are duplicates. Which one should I pick? I dont even have the plate number to select the samples. Is there way to get the whole TCGA-Barcode like "TCGA-A6-6781-01A-22R-A278-07" from UUID's so that I can select based on plate numbers.

ADD REPLY
1
Entering edit mode

Tagging: Sean Davis

ADD REPLY
2
Entering edit mode
6.6 years ago

I've just done some further investigating. It's possible to locate the full barcode of these files

Essentially, as you can already tell, the following 3 UUIDs belong to the same aliquot:

One can tell this by the matched short TCGA barcode (TCGA-A7-A26F-01B), and also the matched Entity ID and Case ID on their respective GDC Legacy Archive records. The full TCGA barcode of these is: TCGA-A7-A26F-01B-04R-A22O-07

-------------------------------------------------------------

For the other 2 samples:

These have the same Case ID as the other samples, but a different matched Entity ID, thus, a different aliquot. Their full TCGA barcode is: TCGA-A7-A26F-01A-21R-A169-07

-------------------------------------------------

Edit 27th September 2018

In situations where you have a duplicate short TCGA barcode / sample, Broad Institute recommends to take the sample with the "highest lexicographical sort value" for the plate number - see HERE and HERE. The plate number is the penultimate segment of the full TCGA Barcode.

Kevin

ADD COMMENT
0
Entering edit mode

How to download Biospecimen XML file? I don't see any download option for this GDC legacy

ADD REPLY
1
Entering edit mode

I edited the final part of my comment since you posted yours.

  1. Go here: https://portal.gdc.cancer.gov/cases/3b7b9c1e-a84c-47ed-983c-9e4b00cbf01a?bioId=2a4747b5-1eeb-45b1-9e92-0e0e3d7a9c1b
  2. Search for nationwidechildrens.org_biospecimen.TCGA-A7-A26F.xml on the page
  3. Download the XML file and opn it
  4. Search for the Entity IDs for your samples
ADD REPLY
0
Entering edit mode

Ok. So, for all the duplicate samples I have to download XML. And the link you gave is harmonized, but the data I downloaded is from gdc legacy.

ADD REPLY
1
Entering edit mode

There is only 1 XML biospecimen file for the TCGA patient whose barcode is TCGA-A7-A26F. If you search for the 2 Entity IDs that you have for your 5 samples in that biospecimen XML, then you'll see the full TCGA barcode.

Further investigation leads me to advise you to not use the 01B samples. Going by the biospecimen data, these are from a FFPE validation that was originally performed. Use the 01A sample and treat them as replicate RNA-seq samples in your study.

ADD REPLY
0
Entering edit mode

Just now checked it is available in legacy gdc also. Thank you !!

ADD REPLY
1
Entering edit mode

Oh, yes, it should be there too. Please read my latest comment too. It appears that the 01B samples are FFPE, so, that's justification enough to not use those.

ADD REPLY
0
Entering edit mode

Sure. thank you very much !!

ADD REPLY
1
Entering edit mode

Sorry, I'm now just confirming for anyone else coming here as to what I am looking to gauge whether it's FFPE or not.

Here are lines from the biospecimen for your 2 Entity IDs (note the reference to FFPE):

  • TCGA Barcode: TCGA-A7-A26F-01A-21R-A169-07
  • Entity ID: 2a4747b5-1eeb-45b1-9e92-0e0e3d7a9c1b
  • File UUIDs: a907f2d1-92ad-4a1b-b439-20e5a7347d5b; eb068925-2dcc-4e18-838f-903ac8d2b661

01a


  • TCGA barcode: TCGA-A7-A26F-01B-04R-A22O-07
  • Entity ID: 1b907925-b33c-4e4a-96e0-65f15b4712b9
  • File UUIDs: 5516dd59-3d95-4bc6-84e7-5719b1bbcabf; b570a72f-5e6c-4301-923b-9992662409ca; ba22d7e6-3e70-4a43-9dc1-59069b39e8c2

01bn

ADD REPLY
0
Entering edit mode

So, I can say that from my question I can select only for sample which will be TCGA-A7-A26F-01A. But still two UUID's has same TCGA-Barcode "TCGA-A7-A26F-01A-21R-A169-07". So from two these two I see that UUID - "a907f2d1-92ad-4a1b-b439-20e5a7347d5b" is with size 10 GB (fastq) and other UUID is eb068925-2dcc-4e18-838f-903ac8d2b661 with 13 GB size fastq. Which one should I prefer?

ADD REPLY
0
Entering edit mode

Check the quality of both files in the non-FFPE sample (01A). File size is no reflection of quality of the reads.

ADD REPLY
0
Entering edit mode

You mean I need to take take both files for alignment and then check the reads? Is that what you are saying or any other?

ADD REPLY
0
Entering edit mode

You can use something like FASTQC from The Babraham Institute in order to look at the FASTQ qualities. You can also then gauge quality post-alignment, such as alignment percent and unique alignments.

ADD REPLY

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6