Entering edit mode
3.9 years ago
P
▴
10
I am looking to compare mutation frequencies between two databases, for example TCGA and cbioportal for the same set of genes.
I understand that Fishers exact maybe the most suitable format. But, I am not able to wrap my head around the actual input data needed for this.
Sample information I have:
Gene cbioportal_n cbioportal_mutation_Freq_per_model TCGA_n TCGA_mutation_Freq
Gene1 91 76.0% 128 72.7%
Also, If there is a way to automate this process in a script with R/ python options.
Thanks!
What is the purpose of this exercise? As far as I understand, cBioPortal is not a data source, it's a place to look at data collected elsewhere. Strictly speaking, TCGA is not a raw data source either but it is the closest you can get to one. cBioPortal operates on various TCGA datasets, so this comparison is comparng among subsets of the same dataset.
I should perhaps edit the title to "Compare frequency of gene mutations in two different databases (internal db vs TCGA)", I simply took TCGA and cbioportal as as example. The idea is to see if there are any significant differences in the occurrence of gene mutations in the two different database. Thanks!
Ah, I see. That makes a lot more sense.
I don't see how that can be done in a statistically meaningful manner though - after all, per gene, you'd have just two numbers -
freq_in_internal_db
andfreq_in_external_db
. How would you do Fisher's test here? Are you testing if genes on average have a higher mutation rate in one cohort vs the other?