Question

Tutorial:Bulk RNA-seq: Different Expression Analysis

2

Entering edit mode

13 months ago

Julia Ma ▴ 120

Content taken verbatim from: https://omicverse.readthedocs.io/en/latest/Tutorials-bulk/t_deg/

An important task of bulk rna-seq analysis is the different expression, which we can perform with omicverse. For different expression analysis, ov change the gene_id to gene_name of matrix first. When our dataset existed the batch effect, we can use the SizeFactors of DEseq2 to normalize it, and use t-test of wilcoxon to calculate the p-value of genes. Here we demonstrate this pipeline with a matrix from featureCounts. The same pipeline would generally be used to analyze any collection of RNA-seq tasks.

Colab_Reproducibility: https://colab.research.google.com/drive/1q5lDfJepbtvNtc1TKz-h4wGUifTZ3i0_?usp=sharing

import omicverse as ov
import pandas as pd
import numpy as np
import scanpy as sc
import matplotlib.pyplot as plt
import seaborn as sns

sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80, facecolor='white')

Geneset Download

When we need to convert a gene id, we need to prepare a mapping pair file. Here we have pre-processed 6 genome gtf files and generated mapping pairs including T2T-CHM13, GRCh38, GRCh37, GRCm39, danRer7, and danRer11. If you need to convert other id_mapping, you can generate your own mapping using gtf Place the files in the genesets directory.

ov.utils.download_geneid_annotation_pair()

......Geneid Annotation Pair download start: pair_GRCm39
......Loading dataset from genesets/pair_GRCm39.tsv
......Geneid Annotation Pair download start: pair_T2TCHM13
......Loading dataset from genesets/pair_T2TCHM13.tsv
......Geneid Annotation Pair download start: pair_GRCh38
......Loading dataset from genesets/pair_GRCh38.tsv
......Geneid Annotation Pair download start: pair_GRCh37
......Loading dataset from genesets/pair_GRCh37.tsv
......Geneid Annotation Pair download start: pair_danRer11
......Loading dataset from genesets/pair_danRer11.tsv
......Geneid Annotation Pair download start: pair_danRer7
......Loading dataset from genesets/pair_danRer7.tsv
......Geneid Annotation Pair download finished!

Note that this dataset has not been processed in any way and is only exported by featureCounts, and Sequence alignment was performed from the genome file of CRCm39

data=pd.read_csv('https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt',index_col=0,sep='\t',header=1)
#replace the columns `.bam` to `` 
data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns]
data.head()

	1--1	1--2	2--1	2--2	3--1	3--2	4--1	4--2	4-3	4-4	Blank-1	Blank-2
Geneid
---	---	---	---	---	---	---	---	---	---	---	---	---
ENSMUSG00000102628	0	0	0	0	5	0	0	0	0	0	0	9
ENSMUSG00000100595	0	0	0	0	0	0	0	0	0	0	0	0
ENSMUSG00000097426	5	0	0	0	0	0	0	1	0	0	0	0
ENSMUSG00000104478	0	0	0	0	0	0	0	0	0	0	0	0
ENSMUSG00000104385	0	0	0	0	0	0	0	0	0	0	0	0

ID mapping

We performed the gene_id mapping by the mapping pair file GRCm39 downloaded before.

data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv')
data.head()

	1--1	1--2	2--1	2--2	3--1	3--2	4--1	4--2	4-3	4-4	Blank-1	Blank-2
U1	0	0	0	0	0	0	0	0	0	0	0	0
Gm36814	0	0	0	0	0	0	0	0	0	0	0	0
1700030A11Rik	0	0	0	0	0	0	0	0	0	0	0	0
Gm3667	0	0	0	1	0	1	0	0	3	0	12	1
Gm9045	10	6	5	0	4	11	7	8	13	7	2	0

Different expression analysis with ov

We can do differential expression analysis very simply by ov, simply by providing an expression matrix. To run DEG, we simply need to:

Read the raw count by featureCount or any other qualify methods.
Create an ov DEseq object.

dds=ov.bulk.pyDEG(data)

We note that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes

dds.drop_duplicates_index()
print('... drop_duplicates_index success')

... drop_duplicates_index success

We also need to remove the batch effect of the expression matrix, estimateSizeFactors of DEseq2 to be used to normalize our matrix

dds.normalize()
print('... estimateSizeFactors and normalize success')

... estimateSizeFactors and normalize success

Now we can calculate the different expression gene from matrix, we need to input the treatment and control groups

treatment_groups=['4-3','4-4']
control_groups=['1--1','1--2']
result=dds.deg_analysis(treatment_groups,control_groups,method='ttest')
result.head()

	pvalue	qvalue	FoldChange	-log(pvalue)	-log(qvalue)	BaseMean	log2(BaseMean)	log2FC	abs(log2FC)	size	sig
U1	NaN	0.000000	1.000000	NaN	inf	0.000000	-inf	0.000000	0.000000	0.100000	sig
Gm36814	NaN	0.000000	1.000000	NaN	inf	0.000000	-inf	0.000000	0.000000	0.100000	sig
1700030A11Rik	NaN	0.000000	1.000000	NaN	inf	0.000000	-inf	0.000000	0.000000	0.100000	sig
Gm3667	0.422650	0.516157	7.290929	0.374019	0.287218	1.486841	-0.427749	2.866103	2.866103	0.729093	normal
Gm9045	0.594931	0.674280	1.281838	0.225533	0.171160	17.658883	3.142322	0.358214	0.358214	0.128184	normal

One important thing is that we do not filter out low expression genes when processing DEGs, and in future versions I will consider building in the corresponding processing.

print(result.shape)
result=result.loc[result['log2(BaseMean)']>1]
print(result.shape)

(54504, 11)
(21271, 11)

We also need to set the threshold of Foldchange, we prepare a method named foldchange_set to finish. This function automatically calculates the appropriate threshold based on the log2FC distribution, but you can also enter it manually.

# -1 means automatically calculates
dds.foldchange_set(fc_threshold=-1,
                   pval_threshold=0.05,
                   logp_max=6)

... Fold change threshold: 1.5699872300836342

Visualize the DEG result and specific genes

To visualize the DEG result, we use plot_volcano to do it. This fuction can visualize the gene interested or high different expression genes. There are some parameters you need to input:

title: The title of volcano
figsize: The size of figure
plot_genes: The genes you interested
plot_genes_num: If you don't have interested genes, you can auto plot it.

dds.plot_volcano(title='DEG Analysis',figsize=(4,4),
                 plot_genes_num=8,plot_genes_fontsize=12,)

<AxesSubplot: title={'center': 'DEG Analysis'}, xlabel='$log_{2}FC$', ylabel='$-log_{10}(qvalue)$'>

enter image description here

To visualize the specific genes, we only need to use the dds.plot_boxplot function to finish it.

dds.plot_boxplot(genes=['Ckap2','Lef1'],treatment_groups=treatment_groups,
                control_groups=control_groups,figsize=(2,3),fontsize=12,
                 legend_bbox=(2,0.55))



(<Figure size 160x240 with 1 Axes>,
 <AxesSubplot: title={'center': 'Gene Expression'}>)

enter image description here

dds.plot_boxplot(genes=['Ckap2'],treatment_groups=treatment_groups,
                control_groups=control_groups,figsize=(2,3),fontsize=12,
                 legend_bbox=(2,0.55))

(<Figure size 160x240 with 1 Axes>,
 <AxesSubplot: title={'center': 'Gene Expression'}>)

enter image description here

Pathway enrichment analysis by ov

Here we use the gseapy package, which included the GSEA analysis and Enrichment. We have optimised the output of the package and given some better looking graph drawing functions

Similarly, we need to download the pathway/genesets first. Five genesets we prepare previously, you can use ov.utils.download_pathway_database() to download automatically. Besides, you can download the pathway you interested from enrichr: https://maayanlab.cloud/Enrichr/#libraries

ov.utils.download_pathway_database()

......Pathway Geneset download start: GO_Biological_Process_2021
......Loading dataset from genesets/GO_Biological_Process_2021.txt
......Pathway Geneset download start: GO_Cellular_Component_2021
......Loading dataset from genesets/GO_Cellular_Component_2021.txt
......Pathway Geneset download start: GO_Molecular_Function_2021
......Loading dataset from genesets/GO_Molecular_Function_2021.txt
......Pathway Geneset download start: WikiPathway_2021_Human
......Loading dataset from genesets/WikiPathway_2021_Human.txt
......Pathway Geneset download start: WikiPathways_2019_Mouse
......Loading dataset from genesets/WikiPathways_2019_Mouse.txt
......Pathway Geneset download start: Reactome_2022
......Loading dataset from genesets/Reactome_2022.txt
......Pathway Geneset download finished!

pathway_dict=ov.utils.geneset_prepare('genesets/WikiPathways_2019_Mouse.txt',organism='Mouse')

Note that the pvalue_type we set to auto, this is because when the genesets we enrichment if too small, use the adjusted pvalue we can't get the correct result. So you can set adjust or raw to get the significant geneset.

deg_genes=dds.result.loc[dds.result['sig']!='normal'].index.tolist()
enr=ov.bulk.geneset_enrichment(gene_list=deg_genes,
                                pathways_dict=pathway_dict,
                                pvalue_type='auto',
                                organism='mouse')

To visualize the enrichment, we use geneset_plot to finish it

ov.bulk.geneset_plot(enr,figsize=(2,5),fig_title='Wiki Pathway enrichment',
                        cmap='Reds')



<AxesSubplot: title={'center': 'Wiki Pathway enrichment'}, xlabel='Fractions of genes'>

enter image description here

differential-expression RNA-seq • 868 views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 13 months ago by Julia Ma ▴ 120