I have been having challenges converting my gff3 file generated strawberry genome (Fragaria vesca) to a bed12 format which is required for annotating differentially methylated bases. I have read through several solutions offered but have not found the one that works for my data. However, I have come across a github script (https://github.com/pzross/iver/blob/master/R/bioinfo.R) which requires that I download gfftogenepred and genepredtobed12 tools from UCSC and run the scripts in R program inorder to generate the bed12 format. At the point of generating a gfftogenepred file, I get the following error message:
/tmp/tmp.gff:0: empty GFF file, must have header
/tmp/tmp.gff:0: invalid GFF3 header
GFF3: 2 parser errors
My GFF3 file looks fine (with 9 columns)
Please I need help.
Thank you in advance
If you already have the 2 tools from UCSC, did you try them without R?
michael.ante, I did download the gff3ToGenePred and genePredToBed tools from UCSC through the Anaconda software package. However, when I run the following script in the Anaconda navigator terminal,I get errors. Below is the command I run and excerpts from the start and end of the response:
End 1. / Users / bruker / Desktop / CpG Rdata / Fragaria_vesca_v4.0. a1.transcripts.gff3:405476: invalid attribute tag, must start with an alphabetic character and be composed of alphanumeric, dash, or underscore characters:_eAED 2. / Users / bruker / Desktop / CpG Rdata / Fragaria_vesca_v4.0. a1.transcripts.gff3: 405476: invalid attribute tag, must start with an alphabetic character and be composed of alphanumeric, dash, or underscore characters:_QI 3. GFF3: 85764 parser errors
It says, an attribute tag (like ID, Parent, or Name) must start with an alphabetic character. In your gff3's second line the attributes are:
Thus, _AED is not allowed since it doesn't start with a character. You can run a sed command to change it accordingly:
All attribute tags will then be changed having an x before the underscore.
Interesting,
gff3ToGenePred introduced peculiarity in the expected gff3 format that does not exist in the official definition of the format.
Maybe it's a requirement for genePred (although not mentioned here)?
michael.ante, I do appreciate your help so far. I was able to introduce an "x" before the underscore. However, I have encountered another challenge in which the converted gff3 file still generates errors. Below is an excerpt of the message:
Command used: gff3ToGenePred - maxParseErrors=50 / Users / bruker / anaconda2 / envs / wgbs - cpg / edited.transcripts.gff3 Fragariavesca.GP
parsing error message
/Users/bruker/anaconda2/envs/wgbs-cpg/edited.transcripts.gff3:4: unknown standard attribute, user defined attributes must start with a lower-case letter:X_AED
/ Users / bruker / anaconda2 / envs / wgbs - cpg / edited.transcripts.gff3:4: unknown standard attribute, user defined attributes must start with a lower-case letter:X_eAED
/ Users / bruker / anaconda2 / envs / wgbs - cpg / edited.transcripts.gff3:4: unknown standard attribute, user defined attributes must start with a lower-case letter:X_QI
I looked into the converted file and realized that the "x" introduced before the underscore was in upper-case despite the fact that I used the lower-case "x". How can I fix this?
I am out of options. Please help
Why not using the command I suggested, inserting a
x
instead of anX
?Please provide us few lines of the beginning of your gff3 file.
below is the beginning of the gff3 file
According to the specs, the header should start with 2 '#':
Michael.ante, you are right. I mistakenly omitted one of the #when copying the file. The original file header is like this ##gff-version 3. Thank you for pointing out the error.
zx8754, thanks for editing my gff3 file. it really looks more like the original version.
So, your file looks perfectly fine. It's the most comprehensive gff3 file you can have. Either you don't provide the proper file to your tool (check the path), or the tool expects a particular gff-like file. Maybe the tool doesn't handle the ### and see that like an empty header? You could give a try providing only the first record with the ##gff-version 3 header as well.
If you are using R already, package rtracklayer should be able to do the same.
I did use rtracklayer as one of the packages for this conversion process but the problem arose when I was running a script to create an intermediate genepred file.
Otherwise I have a script in perl that should do the work. It's called gff2bed.pl in the GAAS repository.