I have two group of genes. I want to know the percentage of genes from group A or B is expressed in TCGA breast cancer data set. For example, there are 100 genes in group A and 200 genes in group B. After summarization there are 45 genes and 100 genes are expressed (45% vs 50%).
The core question is how to define a gene is expressed in TCGA dataset. I define that a gene with median RPKM value >= 0.1, and had 0 expression in less than one fourth of patients is defined as expressed. otherwise no-expressed. this cutoff come from a paper(https://peerj.com/articles/1499/). in this paper authors used this cutoff to define which gene should be included into next step to perform survival analysis.
my question is what I do is suitable or not. do you have some better methods to define. it could be best if you can provide some reference.
There are different ways of viewing what is expressed and what is not. Transcription in the cell is 'pervasive' and is constantly occurring, even in regions that we do not know to have any function. Transcription factors bind to wherever their is accessible chromatin and where there is an electromagnetic potential to bind, mediated via different motifs in the DNA sequence of the accessible chromatin. Most transcripts are in fact non-coding, as you probably know. Most transcripts are also expressed at very low levels, but they are still nevertheless expressed.
Do you want to simply gauge anything that is expressed or do you want to gauge things that are more expressed in one group over another?
After you normalise your data and filter for missingness, you can more or less assume that everything that has a value has exhibited some form of expression. If you have FPKM or RPKM, setting your threshold at 10 is a reasonable idea. Nobody can really argue against 10 as a cut-off; neither could one argue with 5, or 15.
If you want to instead determine expression in a particular group, first transform your data to the Z-scale and then choose those genes that have Z-scores greater than absolute 2 or 3.
cBioPortal most like can already do what you want.
under the first situation you mention above (FPKM situation), I can determine gene A is expressed or not in a particular sample (if FPKM >=10 gene A is expressed and <10 gene A is not expressed). then I determine gene A is expressed or not in every sample (I have over 1000 sample in TCGA breast cancer samples. so I have over 1000 FPKM value for gene A) and set the second cutoff value as 25% (gene A is prevalently expressed in TCGA breast cancer sample if percentage of expressed samples for gene A is over 25%).
then I have two gene sets. I do calculation above for every gene and compare percentage of prevalently expressed gene between these two sets. I want to give the conclusion that set A has more prevalently expressed gene than set B.
Thanks, Kevin
under the first situation you mention above (FPKM situation), I can determine gene A is expressed or not in a particular sample (if FPKM >=10 gene A is expressed and <10 gene A is not expressed). then I determine gene A is expressed or not in every sample (I have over 1000 sample in TCGA breast cancer samples. so I have over 1000 FPKM value for gene A) and set the second cutoff value as 25% (gene A is prevalently expressed in TCGA breast cancer sample if percentage of expressed samples for gene A is over 25%). then I have two gene sets. I do calculation above for every gene and compare percentage of prevalently expressed gene between these two sets. I want to give the conclusion that set A has more prevalently expressed gene than set B.
Is it reasonable to do so to get that conlusion?