But I really don't want to use rpy2 (_1st link_). The second link is probably where I would start, but I first wanted to make sure I'm not reinventing the wheel.
The only comment I would like to express is that you should take into account that the mentioned tools working under R are very well known, have been used in a miriad of experiments, have their sources publicly available so they have been analyzed by many potential users, and because of that, they are the recommended for most of the cases.
You can use PyDESeq2 or the Wilcoxon rank sum test (in scanpy with rank_genes_groups). Wilcoxon yields good results when you have more than 8 replicates per condition.
Here is a tentative to a python implementation of the Wilcoxon test:
import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu
from statsmodels.stats.multitest import multipletests
import scanpy as sc
## assuming 'adata' is your AnnData object
# 'condition' is the column in adata.obs that specifies the conditions
pvalues = []
log_fold_changes = []
for gene in adata.var_names:
data = pd.concat([pd.Series(adata[:, gene].X.toarray().flatten()), adata.obs['condition']], axis=1)
data.columns = ['expression', 'condition']
condition_1_expression = data.loc[data['condition'] == data['condition'].unique()[0], 'expression']
condition_2_expression = data.loc[data['condition'] == data['condition'].unique()[1], 'expression']
p = mannwhitneyu(condition_1_expression, condition_2_expression)[1]
log_fold_change = np.log2((condition_2_expression.mean() + 1) / (condition_1_expression.mean() + 1))
pvalues.append(p)
log_fold_changes.append(log_fold_change)
# adjust p-values using FDR
fdr = multipletests(pvalues, method='fdr_bh')[1]
# combine p-values, adjusted p-values and log fold changes into a DataFrame for easy viewing
results = pd.DataFrame({
'gene': adata.var_names,
'p_value': pvalues,
'fdr': fdr,
'log_fold_change': log_fold_changes
})
You can use AnnData to store your RNA-seq data as well (one sample per row).
The only comment I would like to express is that you should take into account that the mentioned tools working under R are very well known, have been used in a miriad of experiments, have their sources publicly available so they have been analyzed by many potential users, and because of that, they are the recommended for most of the cases.
scanpy
usesdiffxpy
to run DE analysis, BUT, keep in mind thatscanpy
is for single-cell RNA-Seq, not bulk.