Question

How can I extract protein sequence from gff file?

1

Entering edit mode

3.7 years ago

amal.elzemrany ▴ 30

I have a gff file containing genes predicted by AUGUSTUS, the file already containing CDS, exons, and protein sequences. I need to extract protein sequence from the file using bash.

sequence bash gff • 2.0k views

ADD COMMENT • link updated 3.7 years ago by drew.b.ferrell ▴ 140 • written 3.7 years ago by amal.elzemrany ▴ 30

0

Entering edit mode

You can use awk. Replace 1 with the column number where your sequences are.

awk '{print $1}' my.GFF

ADD REPLY • link 3.7 years ago by drew.b.ferrell ▴ 140

1

Entering edit mode

I am a beginner I don't understand why there is a sequence already in the gff format. however, I need to extract the sequence itself to map it You can check a screenshot https://drive.google.com/file/d/1OEw1g0Ayjr7a7yOyPVG0Fsz7rdjmDEPu/view?usp=sharing

ADD REPLY • link 3.7 years ago by amal.elzemrany ▴ 30

0

Entering edit mode

Please do not post the images of the data instead of posting data.

ADD REPLY • link 3.7 years ago by cpad0112 21k

score 4 · Accepted Answer · 2021-06-12

Oh, I see. Strange, and more complicated output. Not impossible though. I'm assuming you want the hash (#) and space removed from the beginning of each protein sequence as well.

Let me know if this does the trick for you:

 awk '/# protein sequence/{a=1}/# Evidence/{a=0}a' Genes.gff | sed 's/# //'

/# protein sequence/ matches lines having this text, as well as /# Evidence/ does.
/# protein sequence/{a=1} sets the flag when the text # protein sequence is found.
/# Evidence/{a=0} unsets the flag when the text /# Evidence is found.
The final a is a pattern with the default action, which is to print $0: if flag is equal 1 the line is printed.
Finally, sed removes the hash and space from the beginning of each line.