Translate Contigs With Gff-Formatted Gene Predictions
2
1
Entering edit mode
13.3 years ago

Hello everyone,

This is my first post and I'm hoping I can get some good input. I've read this forum for a while and have been impressed with everyone's knowledge and professionalism.

I have a genome assembly that I'm trying to predict the number of genes in this genome. I have multiple sources of evidence, including BLAT alignments of unigenes from a closely-related species, GenemarkHMM, Genscan, and Fgenesh+ ab initio predictions. All of these predictions are in GFF3 format. I'm currently wrapping up combining these predictions into a "final" gene set.

The next thing I want to do is take these gene predictions and start doing comparative genomics with them. One of the things I'd like to do is to identify the proteins likely to be secreted (the secretome). To do this, I'd like to translate these gene predictions. Is there software that is commonly used to do this? Or will I have to write something myself in perl?

Thanks in advance,

Wyatt

gene gff translation • 5.9k views
ADD COMMENT
1
Entering edit mode

Are the DNA sequences contained in the gff file?

ADD REPLY
0
Entering edit mode

As Michael says, you've told us everything except the most important thing: what format your "gene predictions" are in. If you have some form of nucleotide sequences, translation to protein could not be easier, but we need to know precisely what you have.

ADD REPLY
2
Entering edit mode
13.3 years ago
Michael 55k

Most tools requrie a FASTA file containing the DNA or AA sequences as input. To start with, we are going to create a FASTA file containing the DNA sequence of the regions annotated in the GFF3 ("chr1.gff") file given a FASTA file ("chr1.fna"). One could expect that there are many tools that take a gff file and a fasta file and give you the sequences, but I haven't found any script that worked, not claiming there is none. The reason is maybe, that it is quite a simple task. One could also use Bioperl and work with the BIO::TOOLS::GFF module, but in this example I use R/Bioconductor which is maybe a bit slower but much less code is required. If you don't like this example, I have others. You need the packages rtracklayer and Biostrings installed.

Annotation software like Artemis might also be albe to do this.

Extracting regions into a FASTA file in R

 library(rtracklayer)
 library(Biostrings)
 gff = import.gff3("chr1.gff")
 fasta = read.DNAStringSet("chr1.fna")
 myview = Views(fasta[[1]], start=start(gff), end=end(gff), 
                names= make.names(gff$ID, unique=TRUE))
 myview[strand(gff)=="-"] <- reverseComplement(myview[strand(gff)=="-"])
 write.XStringSet(x=DNAStringSet(myview), file="out.fas")

One important point to mention is that you need good names in your fasta file, so better have your ID field contain something meaningful, or use more fields from the GFF file.

Translating sequences

You can use Transeq from Emboss tools and translate in all 6 reading frames. For example, to search PFAM for functional domains in your translated sequences. Some tools (like Blastx) will do the translation, then you won't need to translate. For use with Blast (and other software that handles the translation) it is better to let that software handle this step.

Why do I recommend to translate in all frames? Because your annotation of gene starts and exon/intron boundaries might not be perfect yet, getting into the wrong reading frame. Therefore it is safer to check all 3 (or 6, as it is default in transeq, including reverse strand, that won't hurt).

Hope this helps.

ADD COMMENT
0
Entering edit mode

Hi Micheal,

I am also trying to transform a gff3 annotations file into a gene set in Fasta format (using the genome). I was so happy to find your R script! but when executing it I get the following error:

Error in .Primitive("c")(<S4 object="" of="" class="" "XStringViews"="">, <S4 object="" of="" class="" "XStringViews"="">, : missing 'c' method for Vector class XStringViews

Do you know what it is about or how could I solve it? Thank you so much in advanced!

ADD REPLY
0
Entering edit mode

Hard to say, what's wrong here. First, at which point does the error occur? also, give the output of traceback() and sessionInfo(). If you make your files available in addition, we have a reproducible case. Maybe the error comes also with a shortened version of your gff file?

ADD REPLY
0
Entering edit mode
13.3 years ago

Thanks so much, guys!

I apologize for not getting you all the information. I have the predictions in gff format, and the coordinates are set to fasta-formatted sequence files that I have. This script of Michael's should work beautifully! Thanks so much!!

Wyatt

ADD COMMENT

Login before adding your answer.

Traffic: 1049 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6