Question

Pathways analysis taking into account the number of times a gene is present among several patients

0

Entering edit mode

5.0 years ago

jeni ▴ 90

Hi!

This may be a basic question, but I have performed a variant calling of SNPs and INDELs to several patients. Now, I want to study mutated genes in the whole cohort, so I obtained a table in which I have each mutated gene on the cohort and the number of samples in which it is mutated:

Gene     nbsamples
TP53          3
KRAS          5 
 .            .
 .            .
 .            .

Now I want to permorm an analysis of pathways enrichment, but I would like to take into account, not just the mutated genes, but also its mutation frequency in the population, in order to get a better statistical power.

Do you know how can I do this?

snp • 1.1k views

ADD COMMENT • link 5.0 years ago by jeni ▴ 90

score 0 · Answer 1 · 2020-05-14

0

Entering edit mode

5.0 years ago

Papyrus ★ 3.1k

I'm not entirely sure I understood how you came to have the list, but if you're referring to correcting the bias of those genes appearing more or less often in your list because of basal population frequency, you may want check out the R/Bioconductor package goseq. It can be used to perform pathway enrichment of the type where you have 1) list of genes of interest, 2) background of all analyzed genes (the vignette has a clear tutorial). It allows you to adjust for a parameter per gene: for example gene length in RNA-seq, because it is known to influence the probability of appearing as differentially expressed, but any other parameter for which there can be bias (i.e. cause for differential representation) can be used. For example I have used it to correct for different numbers of probes per gene in microarray analysis.

ADD COMMENT • link 5.0 years ago by Papyrus ★ 3.1k

0

Entering edit mode

Thanks! But I don't have expression data for these genes. I just extracted mutated genes from the vcf of my cohort and generate a table with the information shown.

ADD REPLY • link 5.0 years ago by jeni ▴ 90

0

Entering edit mode

No, you do not need expression data. The required input is only a list of genes and background of analyzed genes. And then the parameter for which each gene may be biased. The package then performs enrichment tests to see if there are pathways overrepresented in your list of genes, while taking into account that some of your genes may appear more than expected by chance because of the bias parameter (the analysis is like the typical gene ontology enrichment)

ADD REPLY • link 5.0 years ago by Papyrus ★ 3.1k

0

Entering edit mode

Okay, but I don't understand really well what should I include as background, because actually the variant calling was performed after a whole genome sequencing, so I suppose my background would be all human genes. Moreover, the bias parameter shouldn't be the number of samples in which each gene is mutated; I think I could use gene size as bias, if I am understanding well. What I want is that the program takes into account that, if a gene appears in more samples, it whould weight more than a gene that just appears in one.

ADD REPLY • link 5.0 years ago by jeni ▴ 90

0

Entering edit mode

OK, now I understood your goal. I'm not sure that this is addressed in general in the different types of pathway enrichment analyses: there are gene enrichment approaches where you can input a list of ranked genes, and this usually is a fold-change, p-value... For example GOrilla is a common web tool which suggests to use expression levels to pre-rank genes. Maybe you could use your "number of samples per gene" as a ranking method. However I think it would be better to know more details into how you obtained list of genes (do they have associated p-values, etc., because parameters like this would be important), so personally I cannot suggest any approach

ADD REPLY • link 5.0 years ago by Papyrus ★ 3.1k

0

Entering edit mode

Thank you!

What I did was to transform the vcf from each sample (annotated with VEP) to a dataframe with VariantAnnotation. Then I filtered variants so I got just those deleterious variants (classified by SIFT and Polyphen), and then I concatenated the dataframes of all samples. Finally a extracted the genes affected by those variants and counted the number of samples in which each gene was mutated, which is the table I show.

ADD REPLY • link 5.0 years ago by jeni ▴ 90