Question

Extract fasta sequence from gff3 file

0

Entering edit mode

4.2 years ago

genomes_and_MGEs ▴ 10

Hi everyone,

I have a lot of .gff3 files with the CDS features and below with the fasta sequence. This sequence is separated from the CDS features like this:

##FASTA
>NZ_NZ_LR130533.1

I would like to extract all the fasta sequence into new fasta files, using as output for the fasta file the same name as the input gff3 file. I tried to use bedtools getfasta but that didn't work since I have multiple gff3 files for each input fasta.

Thanks in advance for your help!

sequence • 9.0k views

ADD COMMENT • link updated 17 months ago by aleferna ▴ 10 • written 4.2 years ago by genomes_and_MGEs ▴ 10

1

Entering edit mode

Use gffread utility (LINK).

ADD REPLY • link 4.2 years ago by GenoMax 151k

0

Entering edit mode

Thanks for the answer, but if I understood correctly, this doesn't differ much from bedtools getfasta and as so won't be of much help here. To explain better, I need to create a loop to extract only the fasta sequences inside the gff3 files, and output them into individual fasta files with the same name as the input gff3 files.

ADD REPLY • link 4.2 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Ah I see you have GFF3 files that have sequence in them (not common). Do you know if the sequence ends with a special block (like the ##FASTA before the start)?

ADD REPLY • link 4.2 years ago by GenoMax 151k

0

Entering edit mode

Exactly, it's not common. I just remembered I could use bioawk -c fastx; if you have a better option, please let me know!

ADD REPLY • link 4.2 years ago by genomes_and_MGEs ▴ 10

1

Entering edit mode

If bioawk works then by all means. Otherwise this may need a small parsing program.

ADD REPLY • link 4.2 years ago by GenoMax 151k

0

Entering edit mode

can you post an example line or two?

ADD REPLY • link 4.2 years ago by cpad0112 21k

0

Entering edit mode

There is no example needed. GFF3 sequence section is described in specifications.

GFF3 files can also include sequence in FASTA format at the end of the file.

genomes_and_MGEs : Do you have individual GFF files that contain just one entry/sequence?

ADD REPLY • link 4.2 years ago by GenoMax 151k

0

Entering edit mode

Yes, each individual gff3 file has only a single fasta sequence at the end, right after the CDS features

ADD REPLY • link 4.2 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Hi: To get protein fasta: cufflinks-2.2.1.Linux_x86_64/gffread genome.gff3 -g genome.fa -y genome_pro.fasta

To get cds fasta: cufflinks-2.2.1.Linux_x86_64/gffread genome.gff3 -g genome.fa -x genome_cds.fasta

ADD REPLY • link 4.2 years ago by 956120746 • 0

0

Entering edit mode

This answer is not relevant so I will move it to a comment. OP has a unusual format GFF3 file that contains sequence inside it.

ADD REPLY • link 4.2 years ago by GenoMax 151k

score 1 · Answer 1 · 2021-03-11

Yes, each individual gff3 file has only a single fasta sequence at the end, right after the CDS features

Save this code in a file (file.py).

import sys

togg = False

ou =[]
fi = open(sys.argv[1])
ou = sys.argv[1].split('.')
w = open(ou[0]+".fa",'w')

for line in fi:
    if ">" in line:
        w.write(line)
        togg = True
    elif togg:
        w.write(line)
    else:
        continue

fi.close()
w.close()

Run as follows:

python3 file.py my.gff

Result should be a my.fa with the sequence.

For multiple files at the same time:

for i in *.gff; do echo ${i}; python3 file.py ${i}; done

score 1 · Answer 2 · 2023-12-13

1

Entering edit mode

17 months ago

aleferna ▴ 10

use sed to ignore everything befor the ##FASTA

zcat $gff | sed -n '/^##FASTA/,$p'

ADD COMMENT • link 17 months ago by aleferna ▴ 10