How to select genes before log2 ratio on a RNASeq gene expression matrix, based on signal median
1
2
Entering edit mode
9.9 years ago
fbrundu ▴ 350

I want to transform a TCGA mRNA expression matrix (in linear data format) to log2-ratios and then run a feature (gene) selection, selecting the 1000 most variant genes (genes with higher standard deviation across samples). The workflow is the following:

  1. Select "good" genes before log2ratio (genes each with median signal at least t in p% of samples);
  2. On selected genes, run log2ratio, dividing each gene by its median signal and then log2-transforming the result matrix;
  3. Select the 1000 most variant genes along all samples.

How do I select t and p?

log-ratio expression RNA-Seq feature-selection • 4.9k views
ADD COMMENT
0
Entering edit mode

Hello fbrundu!

It appears that your post has been cross-posted to another site: http://biology.stackexchange.com/questions/29878/how-to-select-genes-before-log2-ratio-on-a-rnaseq-gene-expression-matrix-based

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

Hi Devon,

Yes, I am sorry that I annoyed you. But since they are different communities (afaik they are also run by different organizations) and they address slightly different topics, I did not know which was the correct place to post this question to. I think that my question is semantically correct for both communities (even if they can address different types of users), which may have an users intersection (I did not post elsewhere).

Sorry about that.

Francesco

ADD REPLY
3
Entering edit mode
9.9 years ago

There is not a general solution to select "t" and "p". Such choices are largely arbitrary. Furthermore, for an array platform, if one assumes that "t" has something to do with "expressed", the value for "t" will differ for each probe on the array.

Since you are ultimately going to filter based on variance, I'd suggest starting with your median-centered, log-transformed data and simply choose the top 1000 most variable genes.

ADD COMMENT
1
Entering edit mode

The data that OP is referring to is RNAseq; so no probes. Sequencing bias correction can be done for them.

ADD REPLY
0
Entering edit mode

Thanks Sean. Regarding your last suggestion, I thought that it could introduce problems, since genes with very low median signal can show a high variance when logratio transformed. What do you think?

Thanks

ADD REPLY
0
Entering edit mode

Well log transformation will bring down the variance. Imagine 4 samples [0.5, 2, 8, 32]. Without log transformation the variance is 213.5625 but when you log2 transform the data then the variance reduces to 6.67

In any case if the expression is consistently low then the variance will be low. You should be careful about log transformations especially when doing differential expression studies. I would suggest that you do the log transformation after selecting for median and variance.

ADD REPLY
0
Entering edit mode

I followed your first advice using custom thresholds.

Hoping this is useful, the code of the pipeline is available at

import pandas as pd
import numpy as np
import sys
import random as rnd
csv = sys.argv[1]
out = sys.argv[2]
df = pd.read_table(csv, sep='\t', index_col=0)
# log2
df = np.log2(df)
# remove genes with more than 5% of samples below the 5th percentile
thres = np.percentile(df, 5)
quantile = df.quantile(0.05, axis=1)
genes = quantile[quantile > thres].index.tolist()
df = df.ix[genes, :]
# ratio
median = df.median(axis='columns')
tcga = df.sub(median, axis='index')
# feature selection, most variant genes
Kgenes = 1000
std = df.std(axis='columns')
rank = std.rank(ascending=False)
genes = rank[rank < Kgenes].index.tolist()
df = df.ix[genes, :]
# random sampling 100 samples
Ksamples = 100
samplesix = rnd.sample(range(len(newcolumns)), Ksamples)
samples = [newcolumns[i] for i in samplesix]
df = df.ix[:, samples]
# output
df.to_csv(out, sep='\t')
view raw l2r_fsel_srs.py hosted with ❤ by GitHub

I filtered out the genes which were below the overall 5th percentile in more than the 5% of the samples. I think it could be a reasonable threshold. In case tell me.

Thanks

ADD REPLY

Login before adding your answer.

Traffic: 3365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6