Question

How to get fasta sequences of features from a GFF3 file?

1

Entering edit mode

7.3 years ago

Lina F ▴ 200

Hi all,

I am interested in pulling out "CDS" fasta sequences from a GFF3 file. My GFF3 file lists one or more fasta sequences at the end.

I looked into using the gffutils python package but according to this documentation, it does not deal with fasta sequences listed at the end of the GFF3 file.

Am I stuck writing my own parser, or is there a better solution?

Thanks for any suggestions!

EDITED to add a snippet of my GFF file:

##gff-version 3
##sequence-region Consensus_10_consensus_sequence_1 1 645439
Consensus_10_consensus_sequence_1   feature gene    551 2104    .   +   .   name=gene
Consensus_10_consensus_sequence_1   feature CDS 551 2104    .   +   0   name=YALI0C06512p CDS
...
Consensus_9_consensus_sequence_4    feature mRNA    9625    9891    .   +   .   name=mRNA
Consensus_9_consensus_sequence_4    feature CDS 9625    9891    .   +   0   name=YALI0A21351p CDS
##FASTA
>Consensus_10_consensus_sequence_1 <unknown description>
TGCCACCTCCAAATTAACTCTCGCTTATTTCTTGTACCTGTCATATCACGTGATGTAGCT
TCCCAATCAAGAGCGGATCCTGCCTGTTTGGCTGCGTGGGTTTGCGTCTTCTTTCCGTTT
GAAGCAGTGGTATTATTCCCCCATTGTGCCAAAAGTAATGCTGAAAAGATGCCCACGAAT
>Consensus_10_consensus_sequence_2 <unknown description>
CCGAAACCACAGCCATGATCAGAGTCACTCCTATTCCAACCCCGCCAACTGCCGTGGGGT
ACACGTTATATTGGGTAGGTGTGTATCCTTGAGACTTGAGCCAGGAGATCATGGAAGGTT
GGGTGCTGGCAATACAGGTGTTATTGTAGCAAAGAAAGATGAGAGAGAAGAAATAAATGT
GCCAGGTTTTCAAAGAGCGTTTCAAAACTTGCAAGAACGGCTCTTCTTTGTTACCGGTAT
CGCCGATTTGTGCTCTTCTCTCCTTGGCTATAGCAATGTCTTCCTTGGTGAAGTACCAGC
TGGTAGTTGTCTCCGGTGTGTTTGGGTTGACGAACATGGTGTAAAGTGCCACTGGAAATG
>Consensus_10_consensus_sequence_2 <unknown description>
CACTCTCAAGCATTTAGGAACTTGTCAAGAGGTTCAAAGGTTGGAACTTGCAACTGAACT
GATCGCAACATAATCACCGCTATTAAACCCTGATTACAAGTGCTTCTGATTGCATCCACA
GTTCATTTCCATGGGCTAGGCTATACGAAAATACAAGGATTAGAAACTATATACAATTGA
CTCTGCAATCTTTCCCGCTAAACGGTGGTGTGGTTATGACCTGGCTCGTGTTCATGGCCG
...

gff3 fasta parser • 8.6k views

ADD COMMENT • link updated 7.3 years ago by Matt Shirley 10k • written 7.3 years ago by Lina F ▴ 200

0

Entering edit mode

Please post a snippet of your GFF3 file.

ADD REPLY • link 7.3 years ago by Alex Reynolds 36k

0

Entering edit mode

7.3 years ago

prasundutta87 ▴ 670

As you have the coordinates of your sequence of interest, you can use samtools faidx.

You first have to make an index of your reference genome using samtools faidx itself after which you can then use the same command to extract your sequence of interest..the usage example available from the command line of samtools faidx is helpful..

ADD COMMENT • link 7.3 years ago by prasundutta87 ▴ 670

0

Entering edit mode

If the FASTA sequence is already in the attributes field, it may be a simple parsing step to excise that info.

ADD REPLY • link 7.3 years ago by Alex Reynolds 36k

0

Entering edit mode

7.3 years ago

bartosovic.marek ▴ 30

bedtools getfasta can do that

use -s to force strandness

http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

bedtools getfasta -fi yourgenome.fa -fo out.fa -bed yourgff.bed -s

ADD COMMENT • link 7.3 years ago by bartosovic.marek ▴ 30

score 5 · Accepted Answer · 2017-08-01

5

Entering edit mode

7.3 years ago

Matt Shirley 10k

You should just split your file into two separate files, a GFF and FASTA. Then you can do:

ADD COMMENT • link 7.3 years ago by Matt Shirley 10k

0

Entering edit mode

this approach seemed easiest -- thanks!

ADD REPLY • link 7.3 years ago by Lina F ▴ 200

0

Entering edit mode

How would you translate it to the amino acid sequence?

ADD REPLY • link 5.3 years ago by Ric ▴ 440

0

Entering edit mode

Biopython? They have plethora of methods mimicking the central dogma essentially

ADD REPLY • link 5.1 years ago by manaswwm ▴ 550