Differential gene expression analysis in Python
1
2
Entering edit mode
3.2 years ago
Leendert ▴ 40

It seems that most differential gene expression packages for RNA-Seq are written in R.

Examples include:

  • edgeR
  • limma
  • DESeq

Are any similar (and easy to use) packages available for Python, or have any of the R packages been ported?

The best I could find was:

But I really don't want to use rpy2 (_1st link_). The second link is probably where I would start, but I first wanted to make sure I'm not reinventing the wheel.

SIDE NOTE:

This question was asked on Stackoverflow (https://stackoverflow.com/questions/36305682/differential-gene-expression-analysis-in-python), but closed because your not allowed to ask for suggestions on tools (please explain all the upvotes then), but in any case, I thought I'd bring it over to this forum for suggestions.

python R • 11k views
ADD COMMENT
5
Entering edit mode

The only comment I would like to express is that you should take into account that the mentioned tools working under R are very well known, have been used in a miriad of experiments, have their sources publicly available so they have been analyzed by many potential users, and because of that, they are the recommended for most of the cases.

ADD REPLY
3
Entering edit mode

scanpy uses diffxpy to run DE analysis, BUT, keep in mind that scanpy is for single-cell RNA-Seq, not bulk.

ADD REPLY
2
Entering edit mode
17 months ago
Radu Tanasa ▴ 140

You can use PyDESeq2 or the Wilcoxon rank sum test (in scanpy with rank_genes_groups). Wilcoxon yields good results when you have more than 8 replicates per condition.

Here is a tentative to a python implementation of the Wilcoxon test:

import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu
from statsmodels.stats.multitest import multipletests
import scanpy as sc

## assuming 'adata' is your AnnData object
# 'condition' is the column in adata.obs that specifies the conditions
pvalues = []
log_fold_changes = []
for gene in adata.var_names:
    data = pd.concat([pd.Series(adata[:, gene].X.toarray().flatten()), adata.obs['condition']], axis=1)
    data.columns = ['expression', 'condition']

    condition_1_expression = data.loc[data['condition'] == data['condition'].unique()[0], 'expression']
    condition_2_expression = data.loc[data['condition'] == data['condition'].unique()[1], 'expression']

    p = mannwhitneyu(condition_1_expression, condition_2_expression)[1]
    log_fold_change = np.log2((condition_2_expression.mean() + 1) / (condition_1_expression.mean() + 1))

    pvalues.append(p)
    log_fold_changes.append(log_fold_change)

# adjust p-values using FDR
fdr = multipletests(pvalues, method='fdr_bh')[1]

# combine p-values, adjusted p-values and log fold changes into a DataFrame for easy viewing
results = pd.DataFrame({
    'gene': adata.var_names,
    'p_value': pvalues,
    'fdr': fdr,
    'log_fold_change': log_fold_changes
})

You can use AnnData to store your RNA-seq data as well (one sample per row).

ADD COMMENT

Login before adding your answer.

Traffic: 1773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6