Entering edit mode
5.8 years ago
shoujun.gu
▴
380
I found there are 460 cases in COAD from Firebrowse. But in the downloaded Mutation Annotation files, there are only 154 MAF files. I'd like to know why there is significantly less MAF files than case number?
And in the downloaded Raw Mutation Annotation files, there are 367 MAF files (still less than case number). What's the difference between Raw Annotation MAF and Annotation MAF? Why half of the MAF files are filtered in Annotation MAFs?
Thank you.
Please quote the exact sources of the data that you have downloaded. There can be one or more from many reasons for the discrepancies in the numbers. Note that Firebrowse is a 'third party' and is not the NIH. Firebrowse took the data that the NIH produced and then provided their own processing methodologies. Also consider that the numbers can be explained by variant calls in both normal and tumour tissues, or somatic variant calls in the tumours with respect to the normals. Replicate tumour and normal samples may also have been combined by some rule.
The Annotation mutation file I download is: Mutation_Packager_Calls (MD5) with file name: gdac.broadinstitute.org_COAD.Mutation_Packager_Calls.Level_3.2016012800.0.0.tar.gz
The Raw Annotation mutation file I download is: Mutation_Packager_Raw_Calls (MD5) with file name: gdac.broadinstitute.org_COAD.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0.tar.gz
Thanks! What happened with the TCGA data was that, after a certain period of time, they 'froze' the processing of new samples so that they could actually publish the work. Since the publications, many 1000s of new samples have been processed by the various TCGA centers, and the data subsequently made available. This is why the TCGA project is still very much ongoing (but I do not know much about the funding picture).
So, what Broad Institute (Firebrowse) did was that they continually stayed up to date with all new samples being produced by the TCGA centers. I cannot comment on the naming convention of 'raw MAF', but, in any case, the discrepancy is explained by this.
There is more information through these links:
If I actually go to the GDC Data Portal right now, which is the primary 'source' of the TCGA data, I see 4 MAF files that have >400 cases. Please visit this A Configured Search
Ultimately, do not get too disheartened by the numbers not agreeing. This always happens with TCGA data. If I need TCGA data, I usually take it from the GDC Data Portal and NOT a third party. The third party providers, I find, do not organise their data very well, and confusion arises a lot.
Thank you for your reply!
The reason I looked for the data in Firebrowse is because they provided the normalized RNASeq data between cases, where GDC data portal do not have (correct me if I'm wrong).