Htseq Count To Get Rpkm Values
2
0
Entering edit mode
13.2 years ago
Nebo ▴ 80

I want to use htseq-count http://www-huber.embl.de/users/anders/HTSeq/doc/count.html to get gene counts (RPKM) to analyze DE genes in a RNAseq experiment. I'm NOT interested in alternative splicing, only in getting RPKM values for downstream analyses with DEGseq, an R package. HTseq-count requires me a GFF file, but I only have my reference.fasta. Is there any way I can use the fasta file or convert it to GFF?

htseq gene rpkm • 12k views
ADD COMMENT
1
Entering edit mode

1) Is there any gene information the fasta headers (please post an example if so) 2) What genome are you working with?

ADD REPLY
0
Entering edit mode

The only gene information is the gene ID..this is all I need, I'm not looking for other features such as exons... I'm working with sugarcane, so there is no reference genome, I use as reference SAS (sugarcane assembled sequences) or the sorghum genemodels

ADD REPLY
0
Entering edit mode

see my recipe here. since it sounds like you are using a transcript fasta file, the concept is the same Deg Analysis On 2 Mirna Library

ADD REPLY
3
Entering edit mode
13.1 years ago
Urchgene ▴ 30

I do not think its wise to use RPKM values for DGE in edgeR or DEGSEQ R packages because they are not raw counts but have been normalized already.

If you want to do this.....get this script from https://github.com/vsbuffalo/sam2counts and count raw reads that map to features in a SAM file.

Then choose edgeR over DEGSEQ because you can normalize these raw counts to account for library size and so on .....

good luck.

ADD COMMENT
0
Entering edit mode

I could not see the option to get the counts for each feature? can we use a gtf file to get the counts?

ADD REPLY
0
Entering edit mode

You are supposed to also normalize the raw counts with DESeq, that is one of the steps they tell you to do in the tutorial...

ADD REPLY
2
Entering edit mode
13.2 years ago

Each file type was invented to represent certain type of information. The fasta file was meant to store sequences, a GFF file was meant to represents genomic features (intervals). In general there is no way to directly convert between the two.

As daler above points out, if your fasta file happens to store each gene separately and also lists extra information about the coordinates then we could give you a parser that generates a GFF from it (post the header). Another option if you knew the gene sequences you could align these to the genome and thus creating your own annotations.

ADD COMMENT
0
Entering edit mode

I do know the gene sequences and I've already aligned with novoalign, what I want to do now is to get the expression values in RPKM. I can use the uniquely mapped genes as input in DEGseq, but I'd rather use RPKM ... Should I use cufflinks instead of HTseq count to get the RPKM values, so there is no need for a GFF file?

ADD REPLY

Login before adding your answer.

Traffic: 1495 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6