Welcome to the TCGA data. The more you work with this data, the more inconsistencies you will find, so, care is required.
There can be any one or more of the following reasons for what you have found:
- the TCGA dataset contains multiple biopsies from the same original
tumour, which were then removed in the FireBrowse data
- as the TCGA data was processed in different centers, in some cases
the same biopsy from the same tumour was sequenced twice (or more)
in different centers
- some tumour samples that were from FFPE were removed in the
FireBrowse dataset
On point 2, you'd think that at least the multiple centers would use the same analysis pipeline, but they didn't. The open access TCGA mutation data is an agglomeration of somatic variant calls from different variant callers, which introduces bias, of course.
I quickly checked the FireBrowse and, when you select to download the raw mutation data and the new smalll window opens, you will see some text at the top of the window that says:
Files may also be downloaded here, or with firehose_get, or exported
to GenomeSpace with the SendTo tab.
Click on the link 'here', and you will then be taken to a FTP server where you can get it. The files that you downloaded are just MD5 checksums that are used to check the integrity of the main files after they have been downloaded,
Aside from everything that I've mentioned here, my recommendation is to go with the FireBrowse data because you can then at least cite FireBrowse and avoid having to deal with the many issues related to the data taken direct from the TCGA GDC Data portal.
Kevin
Thanks for this detailed reply Kevin!
I have one follow up query. These two files available for download at Firebrowse are also slightly different. Is it because of different variant callers being used?
Thanks again for your help!
You mean the 'oncotated' versus the other? For information on that, you should take a look here: Mutation Pipelines