I am looking at the yeast reference annotation (in gff3 format) downloaded from either SGD or Ensembl fungi. In both cases, the gff3 file appears to contain weird characters in the attributes field, which cause me a world of trouble downstream. Example:
chrXVI SGD gene 174343 174756 . - . ID=YPL197C;Name=YPL197C;Ontology_term=GO:0003674,GO:0005575,GO:0008150;Note=Dubious%20open%20reading%20frame%3B%20unlikely%20to%20encode%20a%20functional%20protein%2C%20based%20on%20available%20experimental%20and%20comparative%20sequence%20data%3B%20partially%20overlaps%20the%20ribosomal%20gene%20RPB7B;display=Dubious%20open%20reading%20frame;dbxref=SGD:S000006118;orf_classification=Dubious
See the "%20" and "%3B" characters?
As far as I understand these are UTF-8 hex representations of certain characters, but why are they included this way? and how can I get rid of them?
To view the full gff file, download the genome release, extract the tar.gz , and look at the file saccharomyces_cerevisiae_R64-2-1_20150113.gff
%20
represents a space and you could replace with_
usingsed
.looks like html/javascript encoding of, for instance space (%20), and such ...
you could look up what they encode and 'translate' them back? However, also whit space might cause problems downstream, so perhaps better to translate them to something else? ( _ for instance?)
I am wondering what type of analysis you do that those characters cause troubles?
I had issues working with the gff using the python package gffutils, since it decodes the UTF8, so if I have e.g. "%3B" (UTF8 for ';'), it creates an invalid feature record. In any case, I removed all occurrences of "%3B" using
sed
and this seems to solve the issue.