Gff Format And Rna Seq Alignment
2
0
Entering edit mode
13.2 years ago
User 2889 • 0

I have aligned my RNAseq reads to gene models, and I want to count the reads and determine RPKM for differential expression. I used Novoalign to align the reads, and I want to use HTseqcount to assign coverage and expression. All of the programs I've found need the reference to be in .gff format. My gene models are in fasta format. Do I need to align them to my reference genome and convert them to gff format, or is there a way to use these or other programs using the fasta reference input?

rna gff • 6.8k views
ADD COMMENT
1
Entering edit mode

I think there is a confusion: GFF is an annotation format. So it is OK if your reference sequence is in FASTA. What genome are you using? May be the GFF (or GTF) file is already available

ADD REPLY
1
Entering edit mode
13.2 years ago
Michael 55k

I assume you did an alignment against the genes only, maybe because there is no reference genome available. The multi-fasta file you used as reference will contain sequence ids of your gene models and the sequence. Then each gene or exon gets an entry in the GFF file starting at 1 and ending with it's length.

From that FASTA file it should be easy to generate a GFF whith a little script that looks like so:

# Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame>
Gene_1 PredSoftware gene 1 <length of gene1> . + . ID=Gene_1
Gene_2 PredSoftware gene 1 <length of gene2> . - . ID=Gene_2
...

That is the most simple case, if your gene models are more complex, containing introns and exons, then you also have to use the Parent= tag. It depends on how your gene-models are encoded in the fasta file, so I cannot tell you how to do this without an example.

The procedure you are using is not optimal imo. I recommend you rather align against the complete reference genome and use an annotation file e.g. from NCBI genomes that is genome based. But in case that isn't available, you can go ahead as described. If you need help with the script, you will have to provide an example from the FASTA file.

ADD COMMENT
0
Entering edit mode
13.2 years ago
Vitis ★ 2.6k

Do you have a reference genome, in addition to the gene models? If yes, it's pretty straightforward to use blast or blat to map the genes back to the genome, and convert to gff3 with scripts like blast2gff.pl and blat2gff.pl (google and you'll find them). If you only have gene models, I think you can make a gff3 from the scratch, basically each gene is one feature. You do need to pay attention to the right gff3 specs and specific needs of those programs like HTseqcount. This can be easily done with awk if you're familiar with Unix.

ADD COMMENT

Login before adding your answer.

Traffic: 1595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6