Hello,
I've done a rather large transcriptome experiment with a non-model organism. Although its not published yet, I did have access to a reference genome and gff annotation file. I'm at the point where I have lists of DE genes for different contrasts and want to find out what these genes are/do. Because the genome isn't published I can't just use the gene IDs in BLAST. Instead I have to use them to find the protein sequences they correspond to in the gff file and make a fasta file with this sequence info and then BLAST it. Across all the contrasts I've done I probably have upwards of 5,000 genes to BLAST and I'm wondering what the most efficient way to do this is.
Because the protein sequence info in the gff file isn't really listed in its own cell, I can't figure out a way to to just pull out the sequences for the genes I want. The best I can do is convert the gff into a text file and use FIND to locate the gene IDs, then cut/paste the sequences as I make the fasta file that will eventually get BLASTed. With this many genes, it will take a very long time to do this and I just want to know if there's an easier way--- without having to do any coding (which I unfortunately am not proficient in). I've tried to use R to pull the rows I want, but like I said, the protein sequences don't even show up when I convert the gff file to a table. I'll do it by hand if I have to but I just want to make sure I'm not overlooking a faster way.
thanks, carrie
It would help if you could show a few lines of the gff file.
Sure. Here's a link to a screen shot of the gff file in galaxy. http://imgur.com/TQu4qZK
I'm not sure I understand you completely. Is this a reference based transcriptome assembly (e.g. cufflinks)?
The GFF shouldn't have sequences in it. Typically they're used to mark features of genes in a genome (UTRs, exons, etc). http://www.sequenceontology.org/gff3.shtml
The GFF will have the locations of features in your genome. You'll have to use the transcript or CDS features extract the transcripts , from there you'll have to predict your protein sequences from the transcripts. There are several scripts posted on here and other places for extracting sequences from a GFF. For the peptide predictions I would use TransDecoder.
I used TopHat with a reference genome. It's just that the genome is very new and the lab that provided me with it hasn't published it yet. I guess my issue is, now that I have done my DE analysis and have lists of DE gene IDs, how do I assemble the sequences that correspond to these IDs into a fasta file to BLAST them (without have to build the fasta files by hand, e.g. cutting/pasting the sequences one by one for each DE gene).
thanks, carrie
There's a tool in the TopHat suite just for this purpose:
http://cole-trapnell-lab.github.io/cufflinks/file_formats/
See the gffread utility. Not sure how it will treat the protein sequences in the GFF file. Where did these protein sequences come from? Did the lab you got the genome from provide you with annotations for the genome?