Question

Gff Format And Rna Seq Alignment

0

Entering edit mode

13.2 years ago

User 2889 • 0

I have aligned my RNAseq reads to gene models, and I want to count the reads and determine RPKM for differential expression. I used Novoalign to align the reads, and I want to use HTseqcount to assign coverage and expression. All of the programs I've found need the reference to be in .gff format. My gene models are in fasta format. Do I need to align them to my reference genome and convert them to gff format, or is there a way to use these or other programs using the fasta reference input?

rna gff • 6.8k views

ADD COMMENT • link updated 5.1 years ago by Biostar 20 • written 13.2 years ago by User 2889 • 0

1

Entering edit mode

I think there is a confusion: GFF is an annotation format. So it is OK if your reference sequence is in FASTA. What genome are you using? May be the GFF (or GTF) file is already available

ADD REPLY • link 13.2 years ago by Pablo ★ 1.9k

score 1 · Answer 1 · 2011-10-02

I assume you did an alignment against the genes only, maybe because there is no reference genome available. The multi-fasta file you used as reference will contain sequence ids of your gene models and the sequence. Then each gene or exon gets an entry in the GFF file starting at 1 and ending with it's length.

From that FASTA file it should be easy to generate a GFF whith a little script that looks like so:

# Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame>
Gene_1 PredSoftware gene 1 <length of gene1> . + . ID=Gene_1
Gene_2 PredSoftware gene 1 <length of gene2> . - . ID=Gene_2
...

That is the most simple case, if your gene models are more complex, containing introns and exons, then you also have to use the Parent= tag. It depends on how your gene-models are encoded in the fasta file, so I cannot tell you how to do this without an example.

The procedure you are using is not optimal imo. I recommend you rather align against the complete reference genome and use an annotation file e.g. from NCBI genomes that is genome based. But in case that isn't available, you can go ahead as described. If you need help with the script, you will have to provide an example from the FASTA file.

score 0 · Answer 2 · 2011-09-08

Do you have a reference genome, in addition to the gene models? If yes, it's pretty straightforward to use blast or blat to map the genes back to the genome, and convert to gff3 with scripts like blast2gff.pl and blat2gff.pl (google and you'll find them). If you only have gene models, I think you can make a gff3 from the scratch, basically each gene is one feature. You do need to pay attention to the right gff3 specs and specific needs of those programs like HTseqcount. This can be easily done with awk if you're familiar with Unix.