Question

Htseq Count To Get Rpkm Values

0

Entering edit mode

13.2 years ago

Nebo ▴ 80

I want to use htseq-count http://www-huber.embl.de/users/anders/HTSeq/doc/count.html to get gene counts (RPKM) to analyze DE genes in a RNAseq experiment. I'm NOT interested in alternative splicing, only in getting RPKM values for downstream analyses with DEGseq, an R package. HTseq-count requires me a GFF file, but I only have my reference.fasta. Is there any way I can use the fasta file or convert it to GFF?

htseq gene rpkm • 12k views

ADD COMMENT • link updated 13.2 years ago by Urchgene ▴ 30 • written 13.2 years ago by Nebo ▴ 80

1

Entering edit mode

1) Is there any gene information the fasta headers (please post an example if so) 2) What genome are you working with?

ADD REPLY • link 13.2 years ago by Ryan Dale 5.0k

0

Entering edit mode

The only gene information is the gene ID..this is all I need, I'm not looking for other features such as exons... I'm working with sugarcane, so there is no reference genome, I use as reference SAS (sugarcane assembled sequences) or the sorghum genemodels

ADD REPLY • link 13.2 years ago by Nebo ▴ 80

0

Entering edit mode

see my recipe here. since it sounds like you are using a transcript fasta file, the concept is the same Deg Analysis On 2 Mirna Library

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 13.2 years ago by Jeremy Leipzig 22k

score 3 · Answer 1 · 2011-11-02

3

Entering edit mode

13.1 years ago

Urchgene ▴ 30

I do not think its wise to use RPKM values for DGE in edgeR or DEGSEQ R packages because they are not raw counts but have been normalized already.

If you want to do this.....get this script from https://github.com/vsbuffalo/sam2counts and count raw reads that map to features in a SAM file.

Then choose edgeR over DEGSEQ because you can normalize these raw counts to account for library size and so on .....

good luck.

ADD COMMENT • link 13.1 years ago by Urchgene ▴ 30

0

Entering edit mode

I could not see the option to get the counts for each feature? can we use a gtf file to get the counts?

ADD REPLY • link 12.9 years ago by Rm 8.3k

0

Entering edit mode

You are supposed to also normalize the raw counts with DESeq, that is one of the steps they tell you to do in the tutorial...

ADD REPLY • link 11.8 years ago by John St. John ★ 1.2k

score 2 · Answer 2 · 2011-09-14

2

Entering edit mode

13.2 years ago

Istvan Albert 101k

Each file type was invented to represent certain type of information. The fasta file was meant to store sequences, a GFF file was meant to represents genomic features (intervals). In general there is no way to directly convert between the two.

As daler above points out, if your fasta file happens to store each gene separately and also lists extra information about the coordinates then we could give you a parser that generates a GFF from it (post the header). Another option if you knew the gene sequences you could align these to the genome and thus creating your own annotations.

ADD COMMENT • link 13.2 years ago by Istvan Albert 101k

0

Entering edit mode

I do know the gene sequences and I've already aligned with novoalign, what I want to do now is to get the expression values in RPKM. I can use the uniquely mapped genes as input in DEGseq, but I'd rather use RPKM ... Should I use cufflinks instead of HTseq count to get the RPKM values, so there is no need for a GFF file?

ADD REPLY • link 13.2 years ago by Nebo ▴ 80