I have a .gff3 file for an insect genome, which looks like the following:
> Scaffold1 I5K gene 30267 30861 . + . ID=OFAS000001;Name=OfasTmpM000001;
> Scaffold1 I5K mRNA 30267 30861 . + . ID=OFAS000001-RA;Name=OfasTmpM000001-RA;Parent=OFAS000001;
> Scaffold1 I5K exon 30267 30428 . + . ID=OFAS000001-RA-EXON01;Parent=OFAS000001-RA;
> Scaffold1 I5K exon 30763 30861 . + . ID=OFAS000001-RA-EXON02;Parent=OFAS000001-RA;
> Scaffold1 I5K CDS 30267 30428 . + 0 ID=OFAS000001-RA-CDS01;Parent=OFAS000001-RA;
> Scaffold1 I5K CDS 30763 30861 . + 0 ID=OFAS000001-RA-CDS02;Parent=OFAS000001-RA;
> Scaffold1 I5K gene 401906 407738 . + . ID=OFAS000002;Name=OfasTmpA000002;
> Scaffold1 I5K mRNA 401906 407738 . + . ID=OFAS000002-RA;Name=OfasTmpA000002-RA;Parent=OFAS000002;Dbxref=Coils:Coil;
> Scaffold1 I5K exon 401906 401926 . + . ID=OFAS000002-RA-EXON01;Parent=OFAS000002-RA;
> Scaffold1 I5K exon 402939 407738 . + . ID=OFAS000002-RA-EXON02;Parent=OFAS000002-RA;
I have the corresponding genome and CDS fasta files. What I need from all of this is the sequence of each exon in one fasta file. I only want individual exons, nothing else. I've seen this: Extract Cds Fastas From A Gff Annotation + Reference Sequence which seeks to extract CDS sequences only, but I have no clue how to adapt this for my purpose (I'm not skilled in scripting). I've also searched for some scripts for my purpose, but everything I've seen deals with obtaining CDS sequences from .gff files. Any help would be most appreciated.
Here is an idea of how to do this: Extract the lines (which say exon) and then extract the Scaffold name, start, stop positions from those lines in a BED file. Then you can use
getfasta
from BEDTools to get the sequence of the exons.If someone does not provide code for you by later today I will take a look.
Edit: @igor and I posted basically the same idea at the same time. You can follow his since he has the code in there.
Thank y'all!
How can I also get the ID names included in the sequence headings? I stumbled upon Galaxy to get exon sequences after filtering exons using Excel sorting function. However, the Galaxy output headings for the sequences read as "Scaffold[some number]_[start position]_[end position]_[strand direction]", for example:
While this is great, having the ID of the corresponding sequence would be more helpful. This is what my .gff3 file now looks like:
bedtools getfasta
has a-name
flag you can add to "use the “name” column in the BED file for the FASTA headers in the output FASTA file". I am not sure what it will do in your case, but it's worth a try.Bedtools worked but took some reformatting of my gff file for it to work, which Excel is very useful to do!