How to select genes before log2 ratio on a RNASeq gene expression matrix, based on signal median

2

Entering edit mode

9.9 years ago

fbrundu ▴ 350

I want to transform a TCGA mRNA expression matrix (in linear data format) to log2-ratios and then run a feature (gene) selection, selecting the 1000 most variant genes (genes with higher standard deviation across samples). The workflow is the following:

Select "good" genes before log2ratio (genes each with median signal at least t in p% of samples);
On selected genes, run log2ratio, dividing each gene by its median signal and then log2-transforming the result matrix;
Select the 1000 most variant genes along all samples.

How do I select t and p?

log-ratio expression RNA-Seq feature-selection • 4.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by fbrundu ▴ 350

0

Entering edit mode

Hello fbrundu!

It appears that your post has been cross-posted to another site: http://biology.stackexchange.com/questions/29878/how-to-select-genes-before-log2-ratio-on-a-rnaseq-gene-expression-matrix-based

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon,

Yes, I am sorry that I annoyed you. But since they are different communities (afaik they are also run by different organizations) and they address slightly different topics, I did not know which was the correct place to post this question to. I think that my question is semantically correct for both communities (even if they can address different types of users), which may have an users intersection (I did not post elsewhere).

Sorry about that.

Francesco

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by fbrundu ▴ 350

3

Entering edit mode

9.9 years ago

Sean Davis 27k

There is not a general solution to select "t" and "p". Such choices are largely arbitrary. Furthermore, for an array platform, if one assumes that "t" has something to do with "expressed", the value for "t" will differ for each probe on the array.

Since you are ultimately going to filter based on variance, I'd suggest starting with your median-centered, log-transformed data and simply choose the top 1000 most variable genes.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Sean Davis 27k

1

Entering edit mode

The data that OP is referring to is RNAseq; so no probes. Sequencing bias correction can be done for them.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

Thanks Sean. Regarding your last suggestion, I thought that it could introduce problems, since genes with very low median signal can show a high variance when logratio transformed. What do you think?

Thanks

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by fbrundu ▴ 350

0

Entering edit mode

Well log transformation will bring down the variance. Imagine 4 samples [0.5, 2, 8, 32]. Without log transformation the variance is 213.5625 but when you log2 transform the data then the variance reduces to 6.67

In any case if the expression is consistently low then the variance will be low. You should be careful about log transformations especially when doing differential expression studies. I would suggest that you do the log transformation after selecting for median and variance.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Bharat Iyengar ▴ 330

0

Entering edit mode

I followed your first advice using custom thresholds.

Hoping this is useful, the code of the pipeline is available at

	import pandas as pd
	import numpy as np
	import sys
	import random as rnd

	csv = sys.argv[1]
	out = sys.argv[2]

	df = pd.read_table(csv, sep='\t', index_col=0)

	# log2
	df = np.log2(df)

	# remove genes with more than 5% of samples below the 5th percentile
	thres = np.percentile(df, 5)
	quantile = df.quantile(0.05, axis=1)
	genes = quantile[quantile > thres].index.tolist()
	df = df.ix[genes, :]

	# ratio
	median = df.median(axis='columns')
	tcga = df.sub(median, axis='index')

	# feature selection, most variant genes
	Kgenes = 1000

	std = df.std(axis='columns')
	rank = std.rank(ascending=False)
	genes = rank[rank < Kgenes].index.tolist()

	df = df.ix[genes, :]

	# random sampling 100 samples
	Ksamples = 100
	samplesix = rnd.sample(range(len(newcolumns)), Ksamples)
	samples = [newcolumns[i] for i in samplesix]

	df = df.ix[:, samples]

	# output
	df.to_csv(out, sep='\t')

view raw l2r_fsel_srs.py hosted with ❤ by GitHub

I filtered out the genes which were below the overall 5th percentile in more than the 5% of the samples. I think it could be a reasonable threshold. In case tell me.

Thanks

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by fbrundu ▴ 350

Login before adding your answer.