I have downloaded a list of coordinates of yeast genes from Xu et al., 2009 (see table S3). Unfortunately its current format is not a standard format so it does not appear to be compatible with the programs I would like to use i.e. HOMER, bedops or bedtools. I was wondering if anyone could help me get it into a gff format using unix or R (other languages are also welcome if the code is just copy and paste)? I tried to recreate what I saw at the ensembl website, but said programs were still not recognizing it as gff. Here is the beginning of the file (there are actually ~7K lines):
ID chr strand start end type name commonName endConfidence source
ST0001 1 + 9369 9601 SUTs SUT001 SUT001 bothEndsMapped Manual
ST0002 1 + 30073 30905 CUTs CUT001 CUT001 bothEndsMapped Automatic
ST0003 1 + 31153 32985 ORF-T YAL062W GDH3 bothEndsMapped Manual
ST0004 1 + 33361 34897 ORF-T YAL061W BDH2 bothEndsMapped Manual
ST0005 1 + 35097 36393 ORF-T YAL060W BDH1 bothEndsMapped Manual
ST0006 1 + 36545 37329 ORF-T YAL059W ECM1 bothEndsMapped Manual
ST0007 1 + 37409 39033 ORF-T YAL058W CNE1 bothEndsMapped Manual
ST0008 1 + 39217 41969 ORF-T YAL056W GPB2 bothEndsMapped Manual
ST0009 1 + 42161 42833 ORF-T YAL055W PEX22 bothEndsMapped Manual
I am very new to bioinformatics
but looks to me this task could simply done by using regular expression and extract information you need and reformat it?
Delineate chromosome number(chrx), starting and ending by tab(
\t
)