I have this gff file that doesn't seem to conform to the format I expect. The problem is with the last column (although it does seem some row have more tabs if I split by tab). here is what I have:
scaffold10x_1000_pilon AUGUSTUS gene 12711 22079 0.67 - . g1
scaffold10x_1000_pilon AUGUSTUS transcript 12711 22079 0.47 - . g1.t1
scaffold10x_1000_pilon AUGUSTUS stop_codon 12711 12713 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS intron 13044 13486 0.89 - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS intron 13936 21904 0.5 - . transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 12711 13043 0.99 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 13487 13935 0.64 - 2 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 21905 22079 0.67 - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS start_codon 22077 22079 . - 0 transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS transcript 12711 14150 0.2 - . g1.t2
scaffold10x_1000_pilon AUGUSTUS stop_codon 12711 12713 . - 0 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS intron 13044 13486 0.91 - . transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS intron 13936 14128 0.2 - . transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 12711 13043 0.96 - 0 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 13487 13935 0.45 - 2 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS CDS 14129 14150 0.21 - 0 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS start_codon 14148 14150 . - 0 transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon AUGUSTUS gene 41722 42102 0.32 + . g2
scaffold10x_1000_pilon AUGUSTUS transcript 41722 42102 0.32 + . g2.t1
scaffold10x_1000_pilon AUGUSTUS start_codon 41722 41724 . + 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon AUGUSTUS CDS 41722 42102 0.32 + 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon AUGUSTUS stop_codon 42100 42102 . + 0 transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon AUGUSTUS gene 106074 106640 1 + . g3
scaffold10x_1000_pilon AUGUSTUS transcript 106074 106640 1 + . g3.t1
scaffold10x_1000_pilon AUGUSTUS start_codon 106074 106076 . + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold10x_1000_pilon AUGUSTUS CDS 106074 106640 1 + 0 transcript_id "g3.t1"; gene_id "g3";
scaffold10x_1000_pilon AUGUSTUS stop_codon 106638 106640 . + 0 transcript_id "g3.t1"; gene_id "g3";
and here is what I would like to have:
scaffold10x_1000_pilon AUGUSTUS gene 12711 22079 0.67 - . ID=g1
scaffold10x_1000_pilon AUGUSTUS transcript 12711 22079 0.47 - . ID=g1.t1;Parent=g1
scaffold10x_1000_pilon AUGUSTUS stop_codon 12711 12713 . - 0 Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS intron 13044 13486 0.89 - . Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS intron 13936 21904 0.5 - . Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS CDS 12711 13043 0.99 - 0 ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS CDS 13487 13935 0.64 - 2 ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS CDS 21905 22079 0.67 - 0 ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS start_codon 22077 22079 . - 0 Parent=g1.t1
scaffold10x_1000_pilon AUGUSTUS transcript 12711 14150 0.2 - . ID=g1.t2;Parent=g1
scaffold10x_1000_pilon AUGUSTUS stop_codon 12711 12713 . - 0 Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS intron 13044 13486 0.91 - . Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS intron 13936 14128 0.2 - . Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS CDS 12711 13043 0.96 - 0 ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS CDS 13487 13935 0.45 - 2 ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS CDS 14129 14150 0.21 - 0 ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS start_codon 14148 14150 . - 0 Parent=g1.t2
scaffold10x_1000_pilon AUGUSTUS gene 41722 42102 0.32 + . ID=g2
scaffold10x_1000_pilon AUGUSTUS transcript 41722 42102 0.32 + . ID=g2.t1;Parent=g2
scaffold10x_1000_pilon AUGUSTUS start_codon 41722 41724 . + 0 Parent=g2.t1
scaffold10x_1000_pilon AUGUSTUS CDS 41722 42102 0.32 + 0 ID=g2.t1.cds;Parent=g6.t1
scaffold10x_1000_pilon AUGUSTUS stop_codon 42100 42102 . + 0 Parent=g2.t1
scaffold10x_1000_pilon AUGUSTUS gene 106074 106640 1 + . ID=g3
scaffold10x_1000_pilon AUGUSTUS transcript 106074 106640 1 + . ID=g3.t1;Parent=g3
scaffold10x_1000_pilon AUGUSTUS start_codon 106074 106076 . + 0 Parent=g3.t1
scaffold10x_1000_pilon AUGUSTUS CDS 106074 106640 1 + 0 ID=g3.t1.cds;Parent=g3.t1
scaffold10x_1000_pilon AUGUSTUS stop_codon 106638 106640 . + 0 Parent=g3.t1
I have tried to use grep and sed commands in linux, could seem to get it done.then resolved to python( I'm new to all these)
speaking of python, I'm trying to read the file in tabs, then kind of use index base on columns, work with column 3 and 9, which I could index as data[2] and data[8].
Here is what I wrote, I know it may not work like that though, just my idea, I'm a bit new to python too.
data = open("my my_bad_gff", 'r')
new_file = ''
for line in data:
columns = line.rstrip("\n").split("\t")
scaffold = columns[0]
source = columns[1]
feature = columns[2]
start = columns[3]
end = columns[4]
score = columns[5]
strand = columns[6]
frame = columns[7]
attribute = columns[8]
if feature == 'gene': #Im trying to take the row called gene, and assign its content as x, which is g1 in this case
a = str(columns[8])
b = 'ID='+ a #which I think should give me ID=g1
if feature == 'transcript':
columns[8] = a + '.t1' + ';Parent=' + a # hopint it gives me ID=g1.t1;Parent=g1, but how can i make sure '.t1' is not fixes, since transcript number can ncrease for each gene
if feature == 'intron' and 'start_codon' and 'stop_codon':
columns [8] = 'Parent=' + a + '.t1'# should give me Parent=g1.t1
d = columns [8]
if feature == 'CDS':
columns[8] = a + '.t1' + 'cds;' + d #hoping this gives me ID=g1.t1.cds;Parent=g1.t1
new_file.append(data)
would appreciate any idea that could help me. thanks
There is no way to activate the option --gff3=yes of augustus ? That would be convenient...
Otherwise as mentioned in the other post (C: GFF3 file incompatibilities (source: BRAD)) you could use the gxf_to_gff3.pl script from here: https://github.com/NBISweden/GAAS/tree/master/annotation/Tools/bin but you will have to install bioperl and some perl modules. Here are the explanations: https://github.com/NBISweden/GAAS/tree/master/annotation/Tools/Util/gff
Otherwise I'm sure someone will give you a nice bash command to do that in a brute approach.
if I run Augustus as a stand alone with the --gff3=on or =yes, not sure which now, that works, in fact, that's what im trying to confirm my output in this question to. But in Braker ( uses Genemark and Augustus), not sure I can do the --gff3 on option
Truth is, I'm just still in the active learning phase regarding bioinformatics. I have checked the link above. is it possible to clone directly into the GAAS-maser using a wget command line.
I would try out the Perl option too
to clone the repository: git clone https://github.com/NBISweden/GAAS.git