Question

Rna-Seq Raw Counts

1

Entering edit mode

13.2 years ago

Gregor Rot ▴ 550

Hi all,

when counting aligned reads that map to a gene (the exons), how do you compute the raw expression? Does an aligned read need to only overlap an exon to be counted or does it need to map entirely inside an exon?

Example:

Exon location: 100-200
Alignment 1 : 130-170 (+1)
Alignment 2 : 90-120 (+1 yes or no?)

For simplicity i ignore strand here.

Looking at this:

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

the 2 mentioned approaches could be called strict (entire overlap) and non-strict (partial-overlap).

There are different ways to count reads, i am just wondering what approaches you use.

Thanks, Gregor

rna-seq expression • 6.1k views

ADD COMMENT • link updated 13.0 years ago by Ryan Dale 5.0k • written 13.2 years ago by Gregor Rot ▴ 550

0

Entering edit mode

Are you using an aligner that can deal with splicing? If so, that will need to factor into the answer.

ADD REPLY • link 13.2 years ago by Sean Davis 27k

score 1 · Answer 1 · 2012-06-15

It probably depends on the biological questions you're asking.

If you have reads aligned with a spliced aligner, you can use htseq-count to address questions at the gene level since it will count reads that partially overlap exons according to rules described on the page you linked to. Those rules are also discussed here in the context of "how much do you believe your annotations", which might be helpful to you in deciding what kind of counting to perform.

This will work even if you don't have reads from a spliced aligner. Keep in mind that the annotations you use (e.g., full genes vs just exons) can influence the results. Once you have the counts for each gene you can normalize to gene length and library size to get an RPKM-like value, which will be correlated with "raw expression".

However, if you would like to address differential isoform expression, then htseq-count might not be what you want since the way it deals with ambiguous or multimapping reads is to ignore them. Instead, check out Cufflinks or Scripture which put a lot of effort into assigning initially ambiguous reads to specific isoforms. These tools also perform the normalization so that results can be interpreted directly as expression.

score 0 · Answer 2 · 2012-06-14

0

Entering edit mode

13.2 years ago

Istvan Albert 103k

The counts will have to include reads that map to the exons plus reads that map to the exon junctions. Then you need to normalize that count to the length of the transcript and you'll end up with a value that correlates with the relative expression level of the transcript (relative to the other expression levels).

The tool that you link to most likely cannot do this.

ADD COMMENT • link 13.2 years ago by Istvan Albert 103k

0

Entering edit mode

how do you perform this ?

ADD REPLY • link 13.2 years ago by Nicolas Rosewick 11k

0

Entering edit mode

one needs a mapper like mapslice to map to junctions

ADD REPLY • link 13.2 years ago by Istvan Albert 103k

0

Entering edit mode

Thanks for the help, the reads i had in mind map to exon-intron junctions, i was just wondering what do you usually do with them...

ADD REPLY • link 13.2 years ago by Gregor Rot ▴ 550

0

Entering edit mode

what they map to are exon-exon junctions, and those are not necessarily consecutive exons, middle exons may be skipped,

ADD REPLY • link 13.2 years ago by Istvan Albert 103k

0

Entering edit mode

If you're seeing exon-intron junctions, perhaps you are observing intron-retention, an interesting phenomenon related to regulation (alternatively, it could be some contaminating genomic DNA). If you want to quantify the effect, i.e. count events that correspond to defined structures, then you have to count any reads that map cleanly to those structures. If you have a lot of reads that map outside the bounds of your structures, then I'd say either your structures are not sufficient, or your quantification of them will be a little bit fishy if you allow those reads into the expression count.

ADD REPLY • link 13.2 years ago by seidel 11k

0

Entering edit mode

Just to be clear, the normalization by length of transcript is orthogonal to the counting. Whether or not normalizing by gene length is important depends on the applications downstream of the counting. In some cases, normalizing by gene length is counterproductive.

ADD REPLY • link 13.2 years ago by Sean Davis 27k

0

Entering edit mode

true - I am just jumping ahead in trying to also answer what I think the OP is after

ADD REPLY • link 13.2 years ago by Istvan Albert 103k