Extract fasta sequence from gff3 file
2
0
Entering edit mode
3.7 years ago

Hi everyone,

I have a lot of .gff3 files with the CDS features and below with the fasta sequence. This sequence is separated from the CDS features like this:

##FASTA
>NZ_NZ_LR130533.1

I would like to extract all the fasta sequence into new fasta files, using as output for the fasta file the same name as the input gff3 file. I tried to use bedtools getfasta but that didn't work since I have multiple gff3 files for each input fasta.

Thanks in advance for your help!

sequence • 7.7k views
ADD COMMENT
1
Entering edit mode

Use gffread utility (LINK).

ADD REPLY
0
Entering edit mode

Thanks for the answer, but if I understood correctly, this doesn't differ much from bedtools getfasta and as so won't be of much help here. To explain better, I need to create a loop to extract only the fasta sequences inside the gff3 files, and output them into individual fasta files with the same name as the input gff3 files.

ADD REPLY
0
Entering edit mode

Ah I see you have GFF3 files that have sequence in them (not common). Do you know if the sequence ends with a special block (like the ##FASTA before the start)?

ADD REPLY
0
Entering edit mode

Exactly, it's not common. I just remembered I could use bioawk -c fastx; if you have a better option, please let me know!

ADD REPLY
1
Entering edit mode

If bioawk works then by all means. Otherwise this may need a small parsing program.

ADD REPLY
0
Entering edit mode

can you post an example line or two?

ADD REPLY
0
Entering edit mode

There is no example needed. GFF3 sequence section is described in specifications.

GFF3 files can also include sequence in FASTA format at the end of the file.

genomes_and_MGEs : Do you have individual GFF files that contain just one entry/sequence?

ADD REPLY
0
Entering edit mode

Yes, each individual gff3 file has only a single fasta sequence at the end, right after the CDS features

ADD REPLY
0
Entering edit mode

Hi: To get protein fasta: cufflinks-2.2.1.Linux_x86_64/gffread genome.gff3 -g genome.fa -y genome_pro.fasta

To get cds fasta: cufflinks-2.2.1.Linux_x86_64/gffread genome.gff3 -g genome.fa -x genome_cds.fasta

ADD REPLY
0
Entering edit mode

This answer is not relevant so I will move it to a comment. OP has a unusual format GFF3 file that contains sequence inside it.

ADD REPLY
1
Entering edit mode
3.7 years ago
GenoMax 147k

Yes, each individual gff3 file has only a single fasta sequence at the end, right after the CDS features

Save this code in a file (file.py).

import sys

togg = False

ou =[]
fi = open(sys.argv[1])
ou = sys.argv[1].split('.')
w = open(ou[0]+".fa",'w')

for line in fi:
    if ">" in line:
        w.write(line)
        togg = True
    elif togg:
        w.write(line)
    else:
        continue

fi.close()
w.close()

Run as follows:

python3 file.py my.gff

Result should be a my.fa with the sequence.

For multiple files at the same time:

for i in *.gff; do echo ${i}; python3 file.py ${i}; done
ADD COMMENT
1
Entering edit mode
11 months ago
aleferna ▴ 10

use sed to ignore everything befor the ##FASTA

zcat $gff | sed -n '/^##FASTA/,$p'
ADD COMMENT

Login before adding your answer.

Traffic: 1629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6