I have aligned my RNAseq reads to gene models, and I want to count the reads and determine RPKM for differential expression. I used Novoalign to align the reads, and I want to use HTseqcount to assign coverage and expression. All of the programs I've found need the reference to be in .gff format. My gene models are in fasta format. Do I need to align them to my reference genome and convert them to gff format, or is there a way to use these or other programs using the fasta reference input?
I think there is a confusion: GFF is an annotation format. So it is OK if your reference sequence is in FASTA. What genome are you using? May be the GFF (or GTF) file is already available
I assume you did an alignment against the genes only, maybe because there is no reference genome available. The multi-fasta file you used as reference will contain sequence ids of your gene models and the sequence. Then each gene or exon gets an entry in the GFF file starting at 1 and ending with it's length.
From that FASTA file it should be easy to generate a GFF whith a little script that looks like so:
That is the most simple case, if your gene models are more complex, containing introns and exons, then you also have to use the Parent= tag. It depends on how your gene-models are encoded in the fasta file, so I cannot tell you how to do this without an example.
The procedure you are using is not optimal imo. I recommend you rather align against the complete reference genome and use an annotation file e.g. from NCBI genomes that is genome based. But in case that isn't available, you can go ahead as described. If you need help with the script, you will have to provide an example from the FASTA file.
Do you have a reference genome, in addition to the gene models? If yes, it's pretty straightforward to use blast or blat to map the genes back to the genome, and convert to gff3 with scripts like blast2gff.pl and blat2gff.pl (google and you'll find them). If you only have gene models, I think you can make a gff3 from the scratch, basically each gene is one feature. You do need to pay attention to the right gff3 specs and specific needs of those programs like HTseqcount. This can be easily done with awk if you're familiar with Unix.
I think there is a confusion: GFF is an annotation format. So it is OK if your reference sequence is in FASTA. What genome are you using? May be the GFF (or GTF) file is already available