Question

Differential gene expression analysis in Python

2

Entering edit mode

3.5 years ago

Leendert ▴ 40

It seems that most differential gene expression packages for RNA-Seq are written in R.

Examples include:

edgeR
limma
DESeq

Are any similar (and easy to use) packages available for Python, or have any of the R packages been ported?

The best I could find was:

But I really don't want to use rpy2 (_1st link_). The second link is probably where I would start, but I first wanted to make sure I'm not reinventing the wheel.

SIDE NOTE:

This question was asked on Stackoverflow (https://stackoverflow.com/questions/36305682/differential-gene-expression-analysis-in-python), but closed because your not allowed to ask for suggestions on tools (please explain all the upvotes then), but in any case, I thought I'd bring it over to this forum for suggestions.

python R • 12k views

ADD COMMENT • link updated 21 months ago by Radu Tanasa ▴ 150 • written 3.5 years ago by Leendert ▴ 40

5

Entering edit mode

The only comment I would like to express is that you should take into account that the mentioned tools working under R are very well known, have been used in a miriad of experiments, have their sources publicly available so they have been analyzed by many potential users, and because of that, they are the recommended for most of the cases.

ADD REPLY • link 3.5 years ago by Antonio R. Franco ★ 5.2k

3

Entering edit mode

scanpy uses diffxpy to run DE analysis, BUT, keep in mind that scanpy is for single-cell RNA-Seq, not bulk.

ADD REPLY • link 3.5 years ago by fracarb8 ★ 1.7k

score 2 · Accepted Answer · 2023-07-20

You can use PyDESeq2 or the Wilcoxon rank sum test (in scanpy with rank_genes_groups). Wilcoxon yields good results when you have more than 8 replicates per condition.

Here is a tentative to a python implementation of the Wilcoxon test:

import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu
from statsmodels.stats.multitest import multipletests
import scanpy as sc

## assuming 'adata' is your AnnData object
# 'condition' is the column in adata.obs that specifies the conditions
pvalues = []
log_fold_changes = []
for gene in adata.var_names:
    data = pd.concat([pd.Series(adata[:, gene].X.toarray().flatten()), adata.obs['condition']], axis=1)
    data.columns = ['expression', 'condition']

    condition_1_expression = data.loc[data['condition'] == data['condition'].unique()[0], 'expression']
    condition_2_expression = data.loc[data['condition'] == data['condition'].unique()[1], 'expression']

    p = mannwhitneyu(condition_1_expression, condition_2_expression)[1]
    log_fold_change = np.log2((condition_2_expression.mean() + 1) / (condition_1_expression.mean() + 1))

    pvalues.append(p)
    log_fold_changes.append(log_fold_change)

# adjust p-values using FDR
fdr = multipletests(pvalues, method='fdr_bh')[1]

# combine p-values, adjusted p-values and log fold changes into a DataFrame for easy viewing
results = pd.DataFrame({
    'gene': adata.var_names,
    'p_value': pvalues,
    'fdr': fdr,
    'log_fold_change': log_fold_changes
})

You can use AnnData to store your RNA-seq data as well (one sample per row).