How to determine whether a gene is present or absent?
1
2
Entering edit mode
10.0 years ago
lovetl2002 ▴ 40

Suppose I have a level-3 expression matrix from TCGA (normalized), how to filter out absent genes before further processing like gene coexpression analysis? Currently what I do is:

  1. Set a lower threshold, like 5% quantile of all the expression level.
  2. Test whether a gene have enough portion of samples that pass this threshold. Like if 20% of samples have expression level of this gene larger than the threshold, then it can be considered as present.

I think my current method works to some extent. But is it robust enough?

microarray TCGA RNA-Seq • 4.6k views
ADD COMMENT
1
Entering edit mode
10.0 years ago

Keep in mind that some proteins are more efficient at their jobs than others. So a low transcription rate doesn't necessarily mean low or absent protein activity. You've also probably noticed that all or most of the lower quartile genes have a readcount of zero. There are two possible reasons for this:

  1. The gene is not transcribed at all, which is fine, and is exactly what you're trying to remove.
  2. The RNA-seq performed is insufficiently deep to capture the transcript expression. We can't recover these without sequencing more RNA.

Since TCGA RNA-seq is enriched for RNA with poly-A tails (more likely to be protein coding mRNA), I would go as far as to say that a single high quality uniquely mapped paired-end read across a transcript is sufficient evidence of transcription. But I'm a hopeless idealist, so a stricter cutoff from TCGA's RNA-seq working group is to study only genes with at least 3 reads in at least 70% of samples. Click here and here for related notes.

ADD COMMENT
0
Entering edit mode

Hi Cyriac,

When you say 3 reads at the end, do you mean 'paired-end read' as you defined in the beginning ? Wouldn't it be 3 fragments if its so ?

ADD REPLY
1
Entering edit mode
Yes. Some TCGA RNA-seq was done with single end reads. Hence the generalization.
ADD REPLY
0
Entering edit mode

Thanks. What if I use microarray data?

ADD REPLY
0
Entering edit mode

I only have rough experience with gene expression arrays, so I can't recommend any cutoffs. But hybridization bias and the noisiness of probe intensities, make it hard to reach the kind of determinism we get from RNA-seq - of whether a fragment of a specific mRNA was transcribed or not.

ADD REPLY
0
Entering edit mode

Hi

Your topic is related to my project I am working with TCGA colorectal gene expression

Could you help me about Create a Tissue Model for Convert the Affymetrix data to a format that can be used by the "createTissueSpecific()" Cobra Toolbox function we use biocLite("affy") but my data is unc_agilentg4502a_07 and it is ADF format how can convert it to present, absent AP.txt and EID.txt is value of gene expression. Which cutoff is best for it?

ADD REPLY
0
Entering edit mode

Your question is not related to this thread. Please open a new one... and use tags to notify appropriate watchers.

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6