I have a gff file containing genes predicted by AUGUSTUS, the file already containing CDS, exons, and protein sequences.
I need to extract protein sequence from the file using bash.
Oh, I see. Strange, and more complicated output. Not impossible though. I'm assuming you want the hash (#) and space removed from the beginning of each protein sequence as well.
Let me know if this does the trick for you:
awk '/# protein sequence/{a=1}/# Evidence/{a=0}a' Genes.gff | sed 's/# //'
/# protein sequence/ matches lines having this text, as well as /# Evidence/ does.
/# protein sequence/{a=1} sets the flag when the text # protein sequence is found.
/# Evidence/{a=0} unsets the flag when the text /# Evidence is found.
The final a is a pattern with the default action, which is to print $0: if flag is equal 1 the line is printed.
Finally, sed removes the hash and space from the beginning of each line.
You can use awk. Replace 1 with the column number where your sequences are.
I am a beginner I don't understand why there is a sequence already in the gff format. however, I need to extract the sequence itself to map it You can check a screenshot https://drive.google.com/file/d/1OEw1g0Ayjr7a7yOyPVG0Fsz7rdjmDEPu/view?usp=sharing
Please do not post the images of the data instead of posting data.