Hi!
This may be a basic question, but I have performed a variant calling of SNPs and INDELs to several patients. Now, I want to study mutated genes in the whole cohort, so I obtained a table in which I have each mutated gene on the cohort and the number of samples in which it is mutated:
Gene nbsamples
TP53 3
KRAS 5
. .
. .
. .
Now I want to permorm an analysis of pathways enrichment, but I would like to take into account, not just the mutated genes, but also its mutation frequency in the population, in order to get a better statistical power.
Do you know how can I do this?
Thanks! But I don't have expression data for these genes. I just extracted mutated genes from the vcf of my cohort and generate a table with the information shown.
No, you do not need expression data. The required input is only a list of genes and background of analyzed genes. And then the parameter for which each gene may be biased. The package then performs enrichment tests to see if there are pathways overrepresented in your list of genes, while taking into account that some of your genes may appear more than expected by chance because of the bias parameter (the analysis is like the typical gene ontology enrichment)
Okay, but I don't understand really well what should I include as background, because actually the variant calling was performed after a whole genome sequencing, so I suppose my background would be all human genes. Moreover, the bias parameter shouldn't be the number of samples in which each gene is mutated; I think I could use gene size as bias, if I am understanding well. What I want is that the program takes into account that, if a gene appears in more samples, it whould weight more than a gene that just appears in one.
OK, now I understood your goal. I'm not sure that this is addressed in general in the different types of pathway enrichment analyses: there are gene enrichment approaches where you can input a list of ranked genes, and this usually is a fold-change, p-value... For example GOrilla is a common web tool which suggests to use expression levels to pre-rank genes. Maybe you could use your "number of samples per gene" as a ranking method. However I think it would be better to know more details into how you obtained list of genes (do they have associated p-values, etc., because parameters like this would be important), so personally I cannot suggest any approach
Thank you!
What I did was to transform the vcf from each sample (annotated with VEP) to a dataframe with VariantAnnotation. Then I filtered variants so I got just those deleterious variants (classified by SIFT and Polyphen), and then I concatenated the dataframes of all samples. Finally a extracted the genes affected by those variants and counted the number of samples in which each gene was mutated, which is the table I show.