Hi everyone,
I have a lot of .gff3 files with the CDS features and below with the fasta sequence. This sequence is separated from the CDS features like this:
##FASTA
>NZ_NZ_LR130533.1
I would like to extract all the fasta sequence into new fasta files, using as output for the fasta file the same name as the input gff3 file. I tried to use bedtools getfasta
but that didn't work since I have multiple gff3 files for each input fasta.
Thanks in advance for your help!
Use
gffread
utility (LINK).Thanks for the answer, but if I understood correctly, this doesn't differ much from
bedtools getfasta
and as so won't be of much help here. To explain better, I need to create a loop to extract only the fasta sequences inside the gff3 files, and output them into individual fasta files with the same name as the input gff3 files.Ah I see you have GFF3 files that have sequence in them (not common). Do you know if the sequence ends with a special block (like the
##FASTA
before the start)?Exactly, it's not common. I just remembered I could use
bioawk -c fastx
; if you have a better option, please let me know!If
bioawk
works then by all means. Otherwise this may need a small parsing program.can you post an example line or two?
There is no example needed.
GFF3 sequence section
is described in specifications.genomes_and_MGEs : Do you have individual GFF files that contain just one entry/sequence?
Yes, each individual gff3 file has only a single fasta sequence at the end, right after the CDS features
Hi: To get protein fasta: cufflinks-2.2.1.Linux_x86_64/gffread genome.gff3 -g genome.fa -y genome_pro.fasta
To get cds fasta: cufflinks-2.2.1.Linux_x86_64/gffread genome.gff3 -g genome.fa -x genome_cds.fasta
This answer is not relevant so I will move it to a comment. OP has a unusual format GFF3 file that contains sequence inside it.