Question

What is GFF3 and how to extract genomic sequences from it?

6

Entering edit mode

10.6 years ago

MAPK ★ 2.1k

Hi everybody,

I am new in bioinformatics and recently I am learning all these new things which is very exciting. I now have another challenge in working with the genome data. I have Tribolium genome from beetle base. I also have this GFF3 data from the same FTP site. I need to extract full length gene (intron, exons, noncoding regions...) using these two data. Could any of you please enlighten me on getting this done? Thank you for your help!

sequence gene Gff3 genomic • 15k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.6 years ago by MAPK ★ 2.1k

0

Entering edit mode

Given links doesn't seem to be working.

ADD REPLY • link 10.6 years ago by PoGibas 5.1k

0

Entering edit mode

Thanks for replying, I just corrected the link.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by MAPK ★ 2.1k

Ram · Answer 1 · 2014-04-23

Generally, you could use conversion tools in genomic toolkits (like gff2bed) to extract data quickly:

$ gff2bed < annotations.gff | grep -w 'gene' | cut -f1-4 > genes.bed

For example:

$ gff2bed < ChLG2_official_gene.gff3 | grep -w 'gene' | cut -f1-4
ChLG2    4835    6220    TCOGS2:TC004531
ChLG2    10132    20674    TCOGS2:TC004850
ChLG2    24354    25482    TCOGS2:TC004532
ChLG2    25740    30300    TCOGS2:TC004849
ChLG2    30769    31550    TCOGS2:TC004533
ChLG2    35291    36284    TCOGS2:TC004534
ChLG2    36824    39385    TCOGS2:TC004535
ChLG2    64940    65360    TCOGS2:TC004848
ChLG2    76047    76227    TCOGS2:TC004847
ChLG2    77616    90710    TCOGS2:TC004846
...

Once you have these coordinate ranges for your genes, you can query each one against an indexed FASTA file. One approach is discussed in another answer in this thread.

You could do something like the following to get a cleaned-up FASTA file and index:

$ wget ftp://ftp.bioinformatics.ksu.edu//pub/BeetleBase/3.0/Tribolium_genome_sequence.fasta -O - \
   | sed '/^$/d' \
   > Tribolium_genome_sequence.fasta

$ samtools faidx Tribolium_genome_sequence.fasta

(It looks like there are some blank lines in the FASTA file, which causes samtools to segfault. The sed statement strips those lines.)

Once cleaned and indexed, you could do lookups of gene sequences, appending results to the fifth column of the genes.bed file:

$ awk ' \
    { \
        range = $1":"$2"-"$3; \
        system("samtools faidx Tribolium_genome_sequence.fasta "$range); \
    }' genes.bed \
    > genes.fa

$ awk ' \
    { \
        if (/^>/) { \
            printf("\n"); \
        } \
        else { \
            printf("%s", $0); \
        } \
    }' genes.fa \
    | tail -n +2 - \
    | paste genes.bed - \
    > answer.bed

The file answer.bed could be fairly large. It might be preferable to extract subsets of genes of interest from genes.bed using tools like BEDOPS bedops or bedmap, and then do a lookup on the subset, instead of the entire set. (On the other hand, if you build your answer file once, you can always just do repeated lookups on that end product.)

score 3 · Answer 2 · 2014-04-23

3

Entering edit mode

10.6 years ago

Devon Ryan 104k

For some of that the path of least resistance is to simply use gffread from cufflinks, which can directly accept an annotation and a genome sequence in a fasta file and be told to spit out full sequences. I don't recall gffread being able to output intronic sequences (maybe it can, though I doubt it since that'd normally not be useful). For that, one could simply tweak the R (my post) or perl (Alejandro Reyes' post) scripts in this post on seqanswers.

ADD COMMENT • link 10.6 years ago by Devon Ryan 104k

0

Entering edit mode

Seems the gffread link is now broken, just a head's up

ADD REPLY • link 7.7 years ago by a1ultima ▴ 850

0

Entering edit mode

Here are updated links to the program description and it's corresponding GitHub repository.

ADD REPLY • link 6.4 years ago by mmfansler ▴ 460

Ram · Answer 3 · 2014-04-23

You can get start and end coordinates of a gene from gff3 file.

For example:

ChLG10  TCOGS2  gene    11381335    11383395    .   -   .   ID=TCOGS2:TC004355;Name=TC004355;Alias=GLEAN_04355

The start and end coordinates for this gene are 11381335 and 11383395 respectively.

Now you can extract the sequence from the reference fasta file using these coordinates. You can use samtools faidx command. See below for the faidx description from samtools.

samtools faidx <ref.fasta> [region1 [...]]
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.faion the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the RAZF format.

score 0 · Answer 4 · 2014-04-24

0

Entering edit mode

10.6 years ago

MAPK ★ 2.1k

I found this which is very useful:

https://usegalaxy.org/root/index

it has feature to fetch sequences using gff3 file.

ADD COMMENT • link 10.6 years ago by MAPK ★ 2.1k

Ram · Answer 5 · 2014-09-09

The AEGeAn Toolkit includes the xtractore program, designed precisely for this task.

[standage@lappy AEGeAn] bin/xtractore -h

xtractore: extract sequences corresponding to annotated features from the
           given sequence file

Usage: xtractore [options] features.gff3 sequences.fasta
  Options:
    -h|--help             print this help message and exit
    -i|--idfile: FILE     file containing a list of feature IDs (1 per line
                          with no spaces); if provided, only features with
                          IDs in this file will be extracted
    -o|--outfile: FILE    file to which output sequences will be written;
                          default is terminal (stdout)
    -t|--type: STRING     feature type to extract; can be used multiple
                          times to extract features of multiple types
    -v|--verbose          print verbose warning and error messages
    -w|--width: INT       width of each line of sequence in the Fasta
                          output; default is 80; set to 0 for no
                          formatting