A Question About Translating Transcript-Based Microarray Data Into Gene-Based Microarray Data
2
1
Entering edit mode
11.7 years ago
KCC ★ 4.1k

I have some data derived from a microarray. I don't know what chip it's from or anything like that. For each condition, I just have a list of transcript names and three numbers next to them representing three replicates. The numbers are supposed to be on a log scale. I assume it's base 2, but I have no idea.

I want to translate the information into information about genes. How do I take the numbers from multiple transcripts and combine them into a single score for a gene?

Update: The platform is Agilent. It was a custom chip made for the C. elegans genome. I have good information the about the conditions it was collected under such as the strain and growth stage.

I noticed that the columns say 'qnorm'.

Update 2: From googling, I'm guessing qnorm means quantile normalization. Can somebody confirm this? (I realize this might be completely vague information and ultimately not confirmable.) However, does this seem like a reasonable assumption?

microarray • 2.8k views
ADD COMMENT
0
Entering edit mode

Are you even sure that it is microarray data (not for example RNAseq data)? What kind of transcript identifiers do you have? How are the samples described? What does the distribution of values look like? With this kind of information we maybe be able to make an educated guess at the platform. Until we feel confident about the platform it may be inappropriate to summarize values to the gene level. For, example, if they are actually cufflinks derived RNAseq data it might be more appropriate to sum transcript values for all isoforms of a gene. A safer bet maybe to proceed with transcript-level analysis of the values you have. But, then map to genes for downstream interpretation. What kind of analysis do you hope to perform with the gene-level values?

ADD REPLY
0
Entering edit mode

@Obi: Thanks for the feedback. I will modify my question.

ADD REPLY
1
Entering edit mode
11.7 years ago
Michael 55k

Technically, and only technically, you can get a transcript ID to gene ID mapping from e.g. Biomart, then compute a central estimate (e.g. mean or median) value for each gene. It is however questionable if the result will be of much use for you, given that you know almost nothing about the data, and that there are seemingly no replicates. Do you even know the experimental conditions? There is so much microarray data out there with full annotation in for example ArrayExpress, and therefore I would advise that you reject to analyse the data any further.

I see it also as an aspect of professionalism to judge whether or not a given data set warrants further analysis, and as you describe the case it is very likely that there are better data sets from similar experiments out there.

ADD COMMENT
0
Entering edit mode

Yeah, seconded - I wouldn't trust data that I know nothing about (including even what scale the numbers are in).

ADD REPLY
0
Entering edit mode

Sorry. I gave the wrong impression in my question. There are replicates, 3 numbers for each transcript. I only meant to imply that I don't have much information about how the numbers were generated or even what they mean besides 'expression level'. I don't even 100% trust that they are log scale. (In a perfect world, I would refuse yes. I have so far refused.)

ADD REPLY
1
Entering edit mode
11.7 years ago

For the transcript to gene bit, if you have the transcripts in the form genename-RA (I don't work on C. elegans, so assuming it's a standard transcript naming convention), extract just the gene name bit. e.g. for the transcript CG12538-RA in flies, CG12538 is the gene identifier. You'll then likely find that since the experiment has been performed, the genome annotation has changed, so you'll need to update the gene identifiers to the most current version. InterMine should be able to do this (upload your list to modMine), maybe WormBase does too?

Once you've updated the gene identifiers, for some genes you might find that there are multiple probes hitting the same transcripts, and/or that there are results for multiple transcripts from the same gene. I'd combine those using a simple average, but with some sanity checks in place - if you have very different values for different transcripts, or even for different probes within the same transcript, I'd be tempted to filter those from the final dataset (e.g. if you have one transcript that's really upregulated, and another that's really downregulated, it's not really possible to give a sensible value for the gene as a whole).

ADD COMMENT

Login before adding your answer.

Traffic: 1803 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6