Question

Collapse Probes For Same Gene

2

Entering edit mode

13.5 years ago

Rituriya ▴ 50

Dear All,

If there are more than one rows of expression for the same gene, collapse this gene into one row with highest value (maximum) within the column for that gene.

This file consists of 15 columns of expression values for 15 tissues. Its a text file containing affymetrix probes of Hgu133plus2 annotation. I have tried GATEexplorer, ADAPT, BrainCDF, etc. But none of them is useful to me. Can anyone suggest a solution? I tried using Genepattern also, but my file size is just too huge to accept.

gene • 13k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 13.5 years ago by Rituriya ▴ 50

score 9 · Answer 1 · 2012-02-02

This is quite easy in R.

First, you'll need an extra column with the gene names. Assuming your file is tab-delimited with column headers, read it into R:

mydat <- read.table("myfile.txt", header = T, sep = "\t")

Assuming that gene names are in column 16 with the header "gene", that column should now be of class factor:

class(mydat$gene)
[1] "factor"

You can now calculate the maximum by gene name using aggregate:

mydat.max <- aggregate(. ~ gene, data = mydat, max)

The new variable mydat.max is a data frame with gene names in the first column and one row of maximum values.

Just to show a dumb example - if the data frame mydat looks like this:

a    b  gene
1   11     A
2   12     A
3   13     A
4   14     A
5   15     A
6   16     B
7   17     B
8   18     B  
9   19     B
10  20     B

It becomes after aggregate:

gene    a  b
   A    5 15
   B   10 20

Ram · Answer 2 · 2012-02-02

2

Entering edit mode

13.5 years ago

ALchEmiXt ★ 1.9k

Why should you want to do that in the first place?

Duplicates for genes on arrays are beneficial for controls, but they usually also allow you to detect differentially expressed variants (including possible splice variants)!? So Just combining them into a single gene value is loosing analysis resolution AND probably dangerous as well.

ADD COMMENT • link 13.5 years ago by ALchEmiXt ★ 1.9k

1

Entering edit mode

It's quite common to collapse probes down to the gene level in some applications and often, a very simple metric such as median is used. You can argue that information is lost, but so is noise.

ADD REPLY • link 13.5 years ago by Neilfws 49k

1

Entering edit mode

@neilfws I agree on the data reduction and possibly the noise. But taking the highest value....?

ADD REPLY • link 13.5 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Perhaps the op meant to select the row with the lowest P-value (least likely to occur by chance)? Approach is suggested here.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 9.4 years ago by alexvpickering ▴ 60

score 1 · Answer 3 · 2012-02-02

1

Entering edit mode

13.5 years ago

Malachi Griffith 20k

If you have the .CEL file for your array and you wish to summarize it from the probe-level to the gene-level you might try Aroma, Expression Console, Affy Power Tools, RMAExpress, and many others.

Or perhaps, you already have a processed file that contains gene expression values but it still contains some cases where the same gene has multiple values. In that case, many people use scripting for those kinds of file manipulation. For example, using R, Perl, Awk, Python, etc. If that is what you mean, you can try posting a slice of your file and someone may provide examples...

ADD COMMENT • link 13.5 years ago by Malachi Griffith 20k

2

Entering edit mode

Perhaps a small snippet/example using any of the packages in your first paragraph would be more informative.

ADD REPLY • link 13.5 years ago by brentp 24k