Question

Extract Cds Fastas From A Gff Annotation + Reference Sequence

19

Entering edit mode

12.5 years ago

Federico Giorgi ▴ 740

Dear all,

my task is very simple, yet I was wondering if some biop* library already exists to carry it out. I have a gff annotation file, e.g.

chr10    JIGSAWGAZE    gene    17991    30631    .    +    .    ID=VIT_10s0116g00040
chr10    JIGSAWGAZE    mRNA    17991    30631    .    +    .    ID=VIT_10s0116g00040.t01;Parent=VIT_10s0116g00040
chr10    JIGSAWGAZE    UTR    17991    18031    .    +    .    Parent=VIT_10s0116g00040.t01
chr10    JIGSAWGAZE    CDS    18032    18049    .    +    .    Parent=VIT_10s0116g00040.t01
chr10    JIGSAWGAZE    CDS    18300    18378    .    +    .    Parent=VIT_10s0116g00040.t01
chr10    JIGSAWGAZE    CDS    30600    30631    .    +    .    Parent=VIT_10s0116g00040.t01
chr10    JIGSAWGAZE    gene    31111    32814    .    -    .    ID=VIT_10s0116g00060
chr10    JIGSAWGAZE    mRNA    31111    32814    .    -    .    ID=VIT_10s0116g00060.t01;Parent=VIT_10s0116g00060
chr10    JIGSAWGAZE    UTR    31111    31290    .    -    .    Parent=VIT_10s0116g00060.t01
chr10    JIGSAWGAZE    CDS    31291    31835    .    -    .    Parent=VIT_10s0116g00060.t01
chr10    JIGSAWGAZE    CDS    31897    31968    .    -    .    Parent=VIT_10s0116g00060.t01
chr10    JIGSAWGAZE    CDS    32106    32814    .    -    .    Parent=VIT_10s0116g00060.t01

And a genome fasta, containing the chromosome sequences. I need to extract the full length, intron-free CDSs, one per gene. The header of the fasta should be the gene id itself. Is there an automated script that does this? Thank you! :-)

Edit: this is the best solution so far:

gff2fasta.pl

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::Fasta;

$| = 1;    # Flush output
my $outfile_cds = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].cds.fasta" );
my $outfile_pep = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].pep.fasta" );
my $outfile_cdna = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].cdna.fasta" );
my $outfile_gene = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].gene.fasta" );
my $outfile_upstream3000 = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].upstream3000.fasta" );

###### Output type description ######
# cds - translated sequence (starting with ATG and ending with a stop codon included)
# cdna - transcribed sequence (devoid of introns, but containing untranslated exons)
# protein - cds translated (includes a * as the stop codon)
# gene - the entire gene sequence (including UTRs and introns)
# upstream3000 - the 3000 upstream region of the gene (likely including the promoter)

### First, index the genome
my $file_fasta = $ARGV[0];
my $db = Bio::DB::Fasta->new($file_fasta);
print ("Genome fasta parsed\n");



### Second, parse the GFF3
my %CDS;
my %CDNA;
my $mRNA_name;
my $frame;
open GFF, "<$ARGV[1]" or die $!;
while ( my $line = <GFF> ) {
    chomp $line;
    my @array = split( "\t", $line );
    my $type = $array[2];

    if ($type eq 'gene') {
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        my $gene_name = $attrs[0];
        my $gene_start = $array[3];
        my $gene_end = $array[4];
        my $gene_seq = $db->seq( $array[0], $gene_start, $gene_end );
        my $output_gene = Bio::Seq->new(
            -seq        => $gene_seq,
            -id         => $gene_name,
            -display_id => $gene_name,
            -alphabet   => 'dna',
        );


        # The upstream 3000
        my $upstream_start;
        my $upstream_end;
        if($array[6] eq '+') {
            $upstream_start=$gene_start-3000;
            $upstream_end=$gene_start-1;
        }
        elsif ($array[6] eq '-') {
            $upstream_start=$gene_end+1;
            $upstream_end=$gene_end+3000;
        }
        my $upstream_seq = $db->seq( $array[0], $upstream_start, $upstream_end );
        my $output_upstream3000 = Bio::Seq->new(
            -seq        => $upstream_seq,
            -id         => $gene_name."_upstream3000",
            -display_id => $gene_name."_upstream3000",
            -alphabet   => 'dna',
        );

        # Reverse Complement if the frame is minus
        if($array[6] eq '+') {
        }
        elsif ($array[6] eq '-') {
            $output_gene = $output_gene->revcom();
            $output_upstream3000 = $output_upstream3000->revcom();
        }
        else {
            die "Unknown frame! At line $. of the GFF\n";
        }
        $outfile_gene->write_seq($output_gene);
        $outfile_upstream3000->write_seq($output_upstream3000);
    }


    if ( ( $type eq 'mRNA' ) and ( $. > 2 ) ) {
        # CDS: Collect CDSs and extract sequence of the previous mRNA
        my $mergedCDS_seq;
        # WARNING we must sort by $cds_coord[1]


        foreach my $key (sort {$a <=> $b} keys %CDS) { # Ascending numeric sort of the starting coordinate
            my $coord = $CDS{$key};
            my @cds_coord = split( " ", $coord );
            my $cds_seq = $db->seq( $cds_coord[0], $cds_coord[1], $cds_coord[2] );
            $mergedCDS_seq .= $cds_seq;
        }

        my $output_cds = Bio::Seq->new(
            -seq        => $mergedCDS_seq,
            -id         => $mRNA_name,
            -display_id => $mRNA_name,
            -alphabet   => 'dna',
        );
        if ($frame eq '-') {
            $output_cds = $output_cds->revcom();
        }
        my $output_pep = $output_cds->translate();
        $outfile_cds->write_seq($output_cds);
        $outfile_pep->write_seq($output_pep);


        # CDNA: Collect UTRs and CDSs and extract sequence of the previous mRNA
        my $mergedCDNA_seq;
        foreach my $key (sort {$a <=> $b} keys %CDNA) { # Ascending numeric sort of the starting coordinate
            my $coord = $CDNA{$key};
            my @cds_coord = split( " ", $coord );
            my $cds_seq = $db->seq( $cds_coord[0], $cds_coord[1], $cds_coord[2] );
            $mergedCDNA_seq .= $cds_seq;
        }

        my $output_cdna = Bio::Seq->new(
            -seq        => $mergedCDNA_seq,
            -id         => $mRNA_name,
            -display_id => $mRNA_name,
            -alphabet   => 'dna',
        );
        if ($frame eq '-') {
            $output_cdna = $output_cdna->revcom();
        }
        $outfile_cdna->write_seq($output_cdna);



        # Now initialize the next mRNA
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        $mRNA_name = $attrs[0];
        $frame=$array[6];
        %CDS = (); %CDNA = (); # Empty the chunk arrays
    }
    elsif ( $type eq 'mRNA' ) {    # First mRNA
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        $mRNA_name = $attrs[0];
        $frame=$array[6];
    }
    elsif ( $type eq 'CDS' ) {
        my $cds_coord = $array[0] . " " . $array[3] . " " . $array[4];
        $CDS{$array[3]}=$cds_coord;
        $CDNA{$array[3]}=$cds_coord;
    }
    elsif ($type eq 'UTR' ) {
        my $utr_coord = $array[0] . " " . $array[3] . " " . $array[4];
        $CDNA{$array[3]}=$utr_coord;
    }

}

close GFF;

Federico

gff CDS • 43k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 12.5 years ago by Federico Giorgi ▴ 740

0

Entering edit mode

The script is good, but needs adjustment when the 'phase' (8-th column of GFF3) is non-zero. http://www.sequenceontology.org/gff3.shtml

I found a script gff3_file_to_proteins.pl in the EVidenceModeler package.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.8 years ago by fumihiko • 0

0

Entering edit mode

I just want to ask a question, I have the same task as yours, but I can not use Bioperl, so if you see this comment, could you tell me how to deal with this task?

ADD REPLY • link updated 21 months ago by Ram 44k • written 9.6 years ago by lili3122300 • 0

0

Entering edit mode

Re-post this as a comment to original question and use bedtools getfasta.

ADD REPLY • link updated 21 months ago by Ram 44k • written 9.6 years ago by PoGibas 5.1k

0

Entering edit mode

The gff2fasta.pl has a bug: the last gene record in gff3 file won't be output. Because it doesn't have the next gene.

ADD REPLY • link 9.2 years ago by kevinchjp ▴ 10

0

Entering edit mode

I did a workaround by adding a fake gene entry on the bottom of the gff3 file. Is there some coding option to solve this, too? I'm not very familiar to perl.

ADD REPLY • link 3.8 years ago by pascal ▴ 10

0

Entering edit mode

This is great. I have created a github version of this gff2fasta.pl script here and made some modifications to write only sequences that have non-zero length. I will be adding an output for exons as well.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by severin • 0

Ram · Answer 1 · 2012-06-05

BioPerl's Bio::DB::SeqFeatureStore module can parse the GFF into hierarchical gene models, provide random access to the fasta (which is indexed on demand).

What remains is to write a function to splice an mRNA's CDS.

This yields a substantially shorter script, one whose logic has (hopefully) been tested/proved during its development, and that can take an Bio::DB::SeqFeatureStore compatible format in addition to gff and fasta.

The new script, called like this (I downloaded chromosome 10 of grape genome for testing against your GFF)

gff2cdsfa.pl -fasta t/VV10.gdna -gff t/t1.gff

prints output like:

>VIT_10s0116g00040.t01
ATGCCGTTTTCGTTTTTGATTGGGGTTGGAATTTTGCTCATGAATTGGTTTGACATTTGG
AGACAGACGTTGAATAAAGCTTGGGAATCTGATCTTGGATTGTTGCATCTTCAGGACATC
TTATCTTAAATGCCGTTTTCGTTTTTG
>VIT_10s0116g00060.t01
TAATAAATCGATATTCAAAATCATTCAATTCGAGATCTCTCGCTCTATCAAAGCATCGGA
TTTCTAAATTCTCATACGGCCCTCCCGGAATTCCTTTCAGAGCCTGTTGAATAATTTTTA
TAAATGCCATCATTTCACCAATTCGTACTAAATAACGAGATAATGAATCTCCTTCTTTTT
GCCATTGGACTTCCCAATCAAATTCGTCATAACACTCATAATGATCAACTTTAAGAAGAT
CCCATTGTATTCCGGAAGCTTGTAGCATTGGTCCTGACAAACCCCAATTTATTGCTTCTT
CTACACCAATAATGCCTACTCCCTCAACTCGTTCTAAAAAAATAGGATTACGCATAATCA
GTTTTTGATATTCAGCAACCCCTGTTAAAAAATAATCGCAGAAATCCAAACATTTATCTA
TCCATCCATGCGGTAGATCAACAACTACTCCTCCTATACGAAAATAATTATGCATCATTC
TCATACCGGTGGTAGCTTCGAATAGATCATATACTAACTCTCTTTCTCTGAAAATATAGA
AGAAAGGAGTCTGTACACCAATATCTGCCATAAAAGGGCCAAGCCATAATAAATGAGAAG
CTATACGACTCAACTCCAACATAATCACTCTAATATAGCTGGCTCTTTTAGGTACTTGAA
TATTTCCCAATTGCTCTGGTGCATTTGCGGTTATTGCTTCTGTGAACATCTAGTATCGTC
ATAATGTCAGCCAATTTCATTCTTTTAACTAACTGAGGAAGAATTTGCAAATTGATAAAA
CTCAATGAATCGCATGTAGAGATATTGATAACATGTATAAAGTTAATGGTATTTCATAAC
TAATTGATTGAGCAGCAACTCGTAGACCACCTAAAAAGGAATATTTATTATTTGATCCAT
ATCCTAACATAAGAAATCCAATGGGAGCAATACATGAAATGGCGATCCATAAAAAAACAC
CGATATTGAGATCGGATAGAACAAGGTTAGAGCCAAAAGGAATTATTAAATAACTTAGTA
GAATTGATATGACTGCTATGGATGGTCCGATACTGAATAAATGAGTATCTCCTCTAGATG
GAAGAAGATTCTCTTTGAAAAGTAGTTTTGTGCCATCTGCTAGAGCTTGAAGAATTCCCA
AAGGACCGGCATATTCAGGTCCAATACGTTGTTGTATCCCTGCGGATATTTCTCTTTCTA
ACCACACAATTGATAGTACACCTATTATGATTCCCAATACAGGAGTAAAAATAGGTACAA
GCATCCATATGATCCCATAAACCTCTTTTAAGTATTCCAATTGGGAAAAAGAATTGATAG
CTTGTA
`

Here's the code:

#!/usr/bin/env perl
# PURPOSE: for each mRNA declared in -gff, write to STDOUT its spliced CDS sequence taken from -fasta
# USAGE: gff2cdsfa.pl -fasta t/VV10.fa -gff t/t1.gff
# AUTHOR: malcolm.cook@gmail.com

use strict;
use warnings;

use Bio::SeqIO;
use Bio::DB::SeqFeature::Store;
use Bio::SeqFeatureI;
sub Bio::SeqFeatureI::spliced_cds {
    ## PURPOSE: for a (presumed) mRNA SeqFeature, create new Bio::Seq
    ## holding its spliced CDS, reverse complemented as needed, ided
    ## after the display_id, if available, else the load_id.
    my ($self,$db) = @_;
    my @CDS = $self->CDS;
    @CDS=sort {$a cmp $b} @CDS;    # don't depend upon input GFF lines being sorted 
    my $seqstr;
    $seqstr .= $_->seq->seq for @CDS;    
    my $seq = Bio::Seq->new( -id => $self->display_id || $self->load_id,
                 -seq => $seqstr);
    $self->strand == -1 ? $seq->revcom : $seq;
}

my %ARGV=@ARGV;
$ARGV{-adaptor}||='memory';
my $db = Bio::DB::SeqFeature::Store->new(%ARGV) or die $!;

my $ss = $db->get_seq_stream(-type =>  'mRNA');
my $seqout = Bio::SeqIO->new(-format=>'fasta');
while (my $mRNA = $ss->next_seq) {
    $seqout->write_seq($mRNA->spliced_cds)
}

0;

score 6 · Answer 2 · 2012-06-04

And here's the answer!

Usage:

perl gff2fasta.pl file.fasta file.gff3 output

#!/usr/bin/perl
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::Fasta;

$| = 1;    # Flush output
my $outfile_cds = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].cds.fasta" );
my $outfile_pep = Bio::SeqIO->new( -format => 'fasta', -file => ">$ARGV[2].pep.fasta" );



### First, index the genome
my $file_fasta = $ARGV[0];
my $db                   = Bio::DB::Fasta->new($file_fasta);
print ("Genome fasta parsed\n");



### Second, parse the GFF3
my @mRNA;
my $mRNA_name;
my $frame;
open GFF, "<$ARGV[1]" or die $!;
while ( my $line = <GFF> ) {
    chomp $line;
    my @array = split( "\t", $line );
    my $type = $array[2];
    next if ( $type eq 'gene' or $type eq 'UTR' );

    if ( ( $type eq 'mRNA' ) and ( $. > 2 ) ) {
        # Collect CDSs and extract sequence of the previous mRNA
        my $mRNA_seq;
        foreach my $coord (@mRNA) {
            my @cds_coord = split( " ", $coord );
            my $cds_seq = $db->seq( $cds_coord[0], $cds_coord[1], $cds_coord[2] );
            $mRNA_seq .= $cds_seq;
        }

        my $output_nucleotide = Bio::Seq->new(
            -seq        => $mRNA_seq,
            -id         => $mRNA_name,
            -display_id => $mRNA_name,
            -alphabet   => 'dna',
        );
        if ($frame eq '-') {
            $output_nucleotide = $output_nucleotide->revcom();
        }
        my $output_protein = $output_nucleotide->translate();
        $outfile_cds->write_seq($output_nucleotide);
        $outfile_pep->write_seq($output_protein);

        # Now initialize the next mRNA
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        $mRNA_name = $attrs[0];
        $frame=$array[6];
        @mRNA = (); # Empty the mRNA
    }
    elsif ( $type eq 'mRNA' ) {    # First mRNA
        my @attrs = split( ";", $array[8] );
        $attrs[0] =~ s/ID=//;
        $mRNA_name = $attrs[0];
        $frame=$array[6];
    }
    elsif ( $type eq 'CDS' ) {
        my $cds_coord = $array[0] . " " . $array[3] . " " . $array[4];
        push( @mRNA, $cds_coord );
    }
}

close GFF;

EDIT: somehow the forum parser changes the code. I mean < GFF > at line 26, not [HTML] of course.

score 4 · Answer 3 · 2012-06-08

I hoped to squeeze this problem into only a few lines of code using Bio::DB::SeqFeature::Store and Bio::DB::SeqFeature:

use Bio::DB::SeqFeature::Store;
my $db = Bio::DB::SeqFeature::Store->new(-adaptor => 'memory', -gff => '/dir/containing/both/gff/and/fasta/');
foreach my $tr ($db->get_features_by_type('mRNA')) { 
    print ">",$tr->load_id,"\n",$tr->spliced_seq->seq,"\n"; 
}

Produces the following output:

------------- EXCEPTION: Bio::Root::NotImplemented -------------
MSG: Abstract method "Bio::SeqFeatureI::entire_seq" is not implemented by package Bio::DB::SeqFeature.
This is not your fault - author of Bio::DB::SeqFeature should be blamed!

STACK: Error::throw
STACK: Bio::Root::Root::throw /Bio/Root/Root.pm:472
STACK: Bio::Root::RootI::throw_not_implemented /Bio/Root/RootI.pm:748
STACK: Bio::SeqFeatureI::entire_seq /Bio/SeqFeatureI.pm:324
STACK: Bio::SeqFeatureI::spliced_seq /Bio/SeqFeatureI.pm:457
----------------------------------------------------------------

So unfortunately Bio::DB::SeqFeature::spliced_seq() was never properly implemented. A BioPerl developer discussion around this issue between Chris Fields and Jason Stajich dates back two years ago and can be found here:

http://bioperl.org/pipermail/bioperl-l/2010-March/032608.html

score 4 · Answer 4 · 2013-11-28

Bumping an old thread because I just had the same problem.

The reasons the grape sequences didn't look right in Malcolm's example is the CDS segments weren't being sorted properly, resulting in a concatenation of incorrectly ordered CDS segments.

I made my own fasta and gff for testing. Gff file:

Test    blat    gene    1    280    1.0000    +    .    ID=Test0001;Name=Test0001
Test    blat    mRNA    79    235    1.0000    +    .    ID=Test0001;Name=Test0001
Test    blat    CDS    79    96    100    +    .    Parent=Test0001
Test    blat    CDS    118    134    100    +    .    Parent=Test0001
Test    blat    CDS    163    188    100    +    .    Parent=Test0001
Test    blat    CDS    218    235    100    +    .    Parent=Test0001
Test    blat    gene    1    280    1.0000    -    .    ID=Test0002;Name=Test0002
Test    blat    mRNA    79    235    1.0000    -    .    ID=Test0002;Name=Test0002
Test    blat    CDS    79    96    100    -    .    Parent=Test0002
Test    blat    CDS    118    134    100    -    .    Parent=Test0002
Test    blat    CDS    163    188    100    -    .    Parent=Test0002
Test    blat    CDS    218    235    100    -    .    Parent=Test0002

Fasta:

>Test
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

I cannibalized Malcolm's script. The result is a bit messy, but I found this worked better on my test pieces:

#!/usr/bin/perl -w
# PURPOSE: for each gene declared in -gff, write its spliced CDS sequence taken from -fasta
# USAGE: gff2cdsv2.pl file.fa file.gff

use strict;
use warnings;

use File::Basename;
use Bio::SeqIO;
use Bio::Seq;
use Bio::DB::SeqFeature::Store;
use Bio::SeqFeatureI;

my $fasta = $ARGV[0];
my $outfile = basename($fasta, ".fa") . "_cds.fa"; #formats name of infile for outfile

my $gff = $ARGV[1];

open (OUT, ">$outfile") or die "Can't open file for writing: $!"; 

my $db = Bio::DB::SeqFeature::Store->new(    -adaptor => 'memory',
                                            -fasta => $fasta,
                                            -gff => $gff) or die $!; # This is where gff and fasta are loaded

my $ss = $db->get_seq_stream(-type => 'gene'); #looks at gff, find genes and their associated information

while (my $gene = $ss->next_seq) #while looking at each gene
{
    print OUT ">" , $gene -> display_name , "\n"; #fasta id for CDS

    #print OUT $gene -> start , "\t", $gene -> end , "\t", $gene -> strand, "\n"; #test to see if script was reading gff correctly

    my @segments = $gene -> segments(-type => 'CDS');

    my %START;
    my %END;
    foreach my $segment(@segments)
    {
        $START{$segment} = $segment -> start; #create hash of CDS id and start nucleotide
        $END{$segment} = $segment -> end; #create hash of CDS id and end nucleotide
    }

    my @CDS;
    foreach my $segment (sort{$START{$a} <=> $START{$b}} keys %START) #sorts CDS segments by order of starting nucleotide
    {
        my $cds_for_seq = $db->fetch_sequence($gene -> seq_id , $START{$segment}, $END{$segment}); #extracts CDS segment sequence
        push(@CDS, $cds_for_seq); #puts CDS segment sequence to end of array

        #print OUT $segment , "\t" , $START{$segment} , "\t", $END{$segment} ,"\n"; #test to see if CDS segments are ordered correctly
        #print OUT $cds_for_seq , "\n";
    }

    my $CDS_seq = join('', @CDS); #concatenate array of CDS segment sequences

    my $for_obj = Bio::Seq -> new(-seq => $CDS_seq , -alphabet => 'dna'); #load sequence string as sequence object
    my $rev_obj = $for_obj -> revcom; #reverse complement sequence object
    my $rev_seq = $rev_obj -> seq; #convert sequence object to sequence string

    if ($gene -> strand == +1) #check if gene was on plus strand
    {
        print OUT $CDS_seq , "\n";
    }
    elsif ($gene -> strand == -1) #check if gene was on minus strand
    {
        print OUT $rev_seq , "\n";
    }
}

exit;

I haven't tested this on gff files with multiple mRNA predictions per gene. Switching the annotation filter to 'mRNA', rather than 'gene' works on the Brapa genome I've looked at, although it doesn't seem to work on my test gff file.

I still have a few bugs to work out.

score 0 · Answer 5 · 2016-10-06

0

Entering edit mode

8.1 years ago

sutturka ▴ 190

Try bedtools getfasta : http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

ADD COMMENT • link 8.1 years ago by sutturka ▴ 190