Question

Gene Expression In Rnaseq Data

7

Entering edit mode

14.2 years ago

Yahan ▴ 400

This question is rather basic.

When is a gene considered to be expressed in an comparative rnaSeq experiment?

I have RPKM values for each gene.

When this value is near to zero, is it considered to be expressed, or is there another explanation why there is a trace of the transcript?

rna gene • 7.7k views

ADD COMMENT • link updated 14.2 years ago by Michael 55k • written 14.2 years ago by Yahan ▴ 400

0

Entering edit mode

Are you interested in differential analysis, or simply in evidence of transcripts, that is not clear. For DE analysis you are better off using the raw counts. At least for DEseq or edgeR. The packages will internally compute normalization.

ADD REPLY • link 14.2 years ago by Michael 55k

0

Entering edit mode

I am interested finding genes that are expressed only in one tissue and not in the others. So I would say it is differential, but in an absolute way, No up & down regulation.

ADD REPLY • link 14.2 years ago by Yahan ▴ 400

0

Entering edit mode

The problem is you cannot detect that something is not expressed just because you have no reads.

ADD REPLY • link 14.2 years ago by Michael 55k

Ram · Answer 1 · 2011-03-02

Edit after noticing that this is mainly about differential RNA-seq analysis:

First and foremost, to assess significance you need biological replicates, only replicates grant you with an estimate of variance, this has been treated for example in this question:

Rna-Seq Biological Replicates...

Second, I would like to mention that you cannot prove absolutely that a gene is not expressed only because one hasn't found evidence (a non-existence proof is not feasible here).

For computing p-values of differential expression I recommend R packages DEseq or edgeR.

Some of this I have explained in this answer already, there are links to other materials and papers: What Metrics Are Best To Describe The "Coverage" Of Rna-Seq Data?

However, it is definitely a problem if one gene has very few or zero counts in one or more group and the current methods might not be able to assign p-values properly or at all in these cases.

If I understand you correctly, you want to know if a very small number of reads (say at least one) in an RNA-seq experiment is evidence for the region being transcribed (not necessarily expressed).

Yes, every single sequence and it's alignment is evidence in itself, given the sequencer or protocol doesn't make up sequences! We have to agree on this point: the sequence doesn't lie, but ofc there can be errors.

Of course you would like to have more evidence and so for very lowly covered exons you will have to study them more deeply.

Where could the reads come from:

They could originate from a duplicated/highly similar or repetitive region
They could be poor alignments of reads with many sequencing errors
The sequences could be contaminations with vectors, adaptors

To prove your gene being transcribed you have to take a look at the individual alignments:

Filter alignments for duplicate hits to the genome, do you still get coverage
Look at the single alignments, how good are they, large in-dels?
apply quality filtering (after removing duplicates, not before)
look for protocol specific contamination
look at where in the gene the alignments are: are they all in one locus or do they span exons/ introns?
re-align the reads against the genome using a more sensitive aligner e.g.(FASTA or SSearch). Do they still align only a single position?

Hope this helps.

score 4 · Answer 2 · 2011-03-02

If you have an RPKM close to zero, the simplest explanation would be that either the gene is unexpressed or alternatively because you haven't achieved sufficient sequencing depth to detect it. In generation of RNA-seq data, there's a certain margin of error in both the base calling and read mapping. With any read mapping software the user chooses the number of allowable mis-matched bases in a mapped read. If you set this number too high you're likely to end up with a number of improperly aligned reads, which could lead falsely detected low-expression genes (however the default settings are usually low enough to avoid this). If you think this might be the case, you could remap your reads with a stricter mis-match threshold and see if this changes the results.

If you're able, you should also have a look at the raw read counts for each gene; since RPKM is a measure that is normalized for a) number of mapped reads in a given sample and b) length of the transcript, what you'll often find is that genes with RPKM close to zero have raw read counts that are a bit higher. I've used ERANGE in the past, which returns both RPKM and raw counts for each gene.

When it comes to biological interpretation these low expression genes are problematic. Take the example of a gene with an average of 2 reads mapped in 'control' samples and 4 reads mapped in 'treatment' samples. It's possible that differential expression analysis will yield a statistically significant result here, but the biological meaning is ambiguous.