Question

How filter genes to construct co-expression network?

2

Entering edit mode

7.6 years ago

niutster ▴ 110

Hi, I am interested to filter data for constructing co-expression network , Which parameter can i use to filter genes? As i know in WGCNA tutorial, it suggests not to use differential expressed genes(DEG) to filter genes.

WGCNA Co-expreesion network DEG Filtering • 12k views

ADD COMMENT • link updated 3.7 years ago by WouterDeCoster 48k • written 7.6 years ago by niutster ▴ 110

score 5 · Answer 1 · 2018-01-01

5

Entering edit mode

7.6 years ago

Kevin Blighe 89k

The data should just be any normalised dataset that has undergone the standard QC filtering and data processing for things like background noise (microarray), low count transcripts, etc. As WGCNA is fundamentally based on correlation, the data does not necessarily have to be logged or on the Z-scale. Just any normalised data is fine, and obviously it makes sense that all samples are processed in the same way.

WGCNA states not to use differentially expressed genes because it was designed as an unsupervised clustering procedure.

For other network methods, you'd have to check what respective data inputs are required.

Kevin

ADD COMMENT • link 7.6 years ago by Kevin Blighe 89k

1

Entering edit mode

Hi Kevin, what do you think about filtering genes with a low variance of expression, e.g. taking the top 50% most variable genes?

ADD REPLY • link 7.6 years ago by WouterDeCoster 48k

0

Entering edit mode

That's also a great idea of which I had not thought

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks, Could you explain more about filtering based on low variance ? How can do it?

ADD REPLY • link 7.6 years ago by niutster ▴ 110

6

Entering edit mode

in R, assuming your matrix of gene expression is called data:

data$variance = apply(data, 1, var)
data2 = data[data$variance >= quantile(data$variance, c(.50)), ] #50% most variable genes
data2$variance <- NULL

Essentially this code creates a "variance" column, selects those which are in the top 50%, and removes that column. I don't know if you are using a (genes * samples) or (samples * genes) matrix, so you may have to change the 1 in the first line to use the apply() function in the other dimension.

ADD REPLY • link 7.6 years ago by WouterDeCoster 48k

1

Entering edit mode

OP can also use varFilter function in genefilter package in R.

ADD REPLY • link 7.6 years ago by cpad0112 21k

2

Entering edit mode

you can use the following code to filter 50% of genes:

Library(genefilter)
    genes<-varFilter(exp)

or this code for example to keep only 20%of genes:

genes<-varFilter(exp, var.func=IQR, var.cutoff=0.8, filterByQuantile=TRUE)

ADD REPLY • link 7.6 years ago by mannoulag1 ▴ 130

0

Entering edit mode

Dear WouterDeCooster

Thanks for your comment. I found another filtering strategy in an article that authors had selected genes if presented at least in 50% of samples.I mean that I have to keep genes that present at least in 50% of samples.

could you please share your comment about that strategy and help me for writing R code about that filtering?

Best Regards,

ADD REPLY • link 6.7 years ago by modarzi ▴ 170

0

Entering edit mode

Hi,

Although it's late, I'll be glad to hear from you. Here, for gene filtering, instead of using variance (and keep 50% of most variable genes), is it better to use the coefficient of variance (cv) as it considers the mean of data?

ADD REPLY • link 4.0 years ago by seta ★ 1.9k

0

Entering edit mode

Dear Wouter,

Thank for your code. I find it very helpful as I am also working with WGCNA.

from your code below

**data$variance = apply(data, 1, var)

data2 = data[data$variance >= quantile(data$variance, c(.50)), ] #50% most variable genes

data2$variance <- NULL**

Does your code work with gene (row) and column(Sample ID)?

I have normalized count matrix with gene (row) and sample ID (column). However, I had to remove the gene ID out to be able for your code to work. therefore, I ended up not knowing what gene are left in the count matrix after filtering only 50% most variable genes.

Do you have any way to tackle this issue? thank in advance.

Regards,

synat

ADD REPLY • link 3.7 years ago by synat.keam ▴ 120

0

Entering edit mode

I think, but it is a long time ago, that my gene identifiers were the row names of the data object.

ADD REPLY • link 3.7 years ago by WouterDeCoster 48k

0

Entering edit mode

Data normalization and pre- processing was performed , I just want to filter data to reduce the volume of data.

ADD REPLY • link 7.6 years ago by niutster ▴ 110

0

Entering edit mode

Okay. It can handle large datasets via the blockwiseModules function.

Alternatively, one thing you could do is filter your genes based on a specific pathway (like 'DNA repair' genes, 'Wnt signalling', etc). Obviously then it's no longer entirely unbiased.

ADD REPLY • link 7.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Is it good to use DESeq2 normalized count for WGCNA. The counts obtained by counts(dds, normalized=TRUE)

ADD REPLY • link 7.0 years ago by Arindam Ghosh ▴ 550

2

Entering edit mode

As mentioned, and according to the WGCNA authors, un-logged or logged data is fine - the most important is that it's processed in the same way. However, I don't know how they did their validations because results will differ between logged and un-logged normalised counts.

Why not try the counts from counts(dds, normalized=TRUE) and also those from the regularised log function of DESEq2?

ADD REPLY • link 7.0 years ago by Kevin Blighe 89k