Question

How can I use rewrite the contents of the attribute column in a gff output from AUGUSTUS through BRAKER

0

Entering edit mode

7.3 years ago

Kay • 0

I have this gff file that doesn't seem to conform to the format I expect. The problem is with the last column (although it does seem some row have more tabs if I split by tab). here is what I have:

scaffold10x_1000_pilon  AUGUSTUS        gene    12711   22079   0.67    -       .       g1
scaffold10x_1000_pilon  AUGUSTUS        transcript      12711   22079   0.47    -       .       g1.t1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      12711   12713   .       -       0       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        intron  13044   13486   0.89    -       .       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        intron  13936   21904   0.5     -       .       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     12711   13043   0.99    -       0       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     13487   13935   0.64    -       2       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     21905   22079   0.67    -       0       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        start_codon     22077   22079   .       -       0       transcript_id "g1.t1"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        transcript      12711   14150   0.2     -       .       g1.t2
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      12711   12713   .       -       0       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        intron  13044   13486   0.91    -       .       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        intron  13936   14128   0.2     -       .       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     12711   13043   0.96    -       0       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     13487   13935   0.45    -       2       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        CDS     14129   14150   0.21    -       0       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        start_codon     14148   14150   .       -       0       transcript_id "g1.t2"; gene_id "g1";
scaffold10x_1000_pilon  AUGUSTUS        gene    41722   42102   0.32    +       .       g2
scaffold10x_1000_pilon  AUGUSTUS        transcript      41722   42102   0.32    +       .       g2.t1
scaffold10x_1000_pilon  AUGUSTUS        start_codon     41722   41724   .       +       0       transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon  AUGUSTUS        CDS     41722   42102   0.32    +       0       transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      42100   42102   .       +       0       transcript_id "g2.t1"; gene_id "g2";
scaffold10x_1000_pilon  AUGUSTUS        gene    106074  106640  1       +       .       g3
scaffold10x_1000_pilon  AUGUSTUS        transcript      106074  106640  1       +       .       g3.t1
scaffold10x_1000_pilon  AUGUSTUS        start_codon     106074  106076  .       +       0       transcript_id "g3.t1"; gene_id "g3";
scaffold10x_1000_pilon  AUGUSTUS        CDS     106074  106640  1       +       0       transcript_id "g3.t1"; gene_id "g3";
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      106638  106640  .       +       0       transcript_id "g3.t1"; gene_id "g3";

and here is what I would like to have:

scaffold10x_1000_pilon  AUGUSTUS        gene    12711   22079   0.67    -       .       ID=g1
scaffold10x_1000_pilon  AUGUSTUS        transcript      12711   22079   0.47    -       .       ID=g1.t1;Parent=g1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      12711   12713   .       -       0       Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        intron  13044   13486   0.89    -       .       Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        intron  13936   21904   0.5     -       .       Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     12711   13043   0.99    -       0       ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     13487   13935   0.64    -       2       ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     21905   22079   0.67    -       0       ID=g1.t1.cds;Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        start_codon     22077   22079   .       -       0       Parent=g1.t1
scaffold10x_1000_pilon  AUGUSTUS        transcript      12711   14150   0.2     -       .       ID=g1.t2;Parent=g1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      12711   12713   .       -       0       Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        intron  13044   13486   0.91    -       .       Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        intron  13936   14128   0.2     -       .       Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        CDS     12711   13043   0.96    -       0       ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        CDS     13487   13935   0.45    -       2       ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        CDS     14129   14150   0.21    -       0       ID=g1.t2.cds;Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        start_codon     14148   14150   .       -       0       Parent=g1.t2
scaffold10x_1000_pilon  AUGUSTUS        gene    41722   42102   0.32    +       .       ID=g2
scaffold10x_1000_pilon  AUGUSTUS        transcript      41722   42102   0.32    +       .       ID=g2.t1;Parent=g2
scaffold10x_1000_pilon  AUGUSTUS        start_codon     41722   41724   .       +       0       Parent=g2.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     41722   42102   0.32    +       0       ID=g2.t1.cds;Parent=g6.t1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      42100   42102   .       +       0       Parent=g2.t1
scaffold10x_1000_pilon  AUGUSTUS        gene    106074  106640  1       +       .       ID=g3
scaffold10x_1000_pilon  AUGUSTUS        transcript      106074  106640  1       +       .       ID=g3.t1;Parent=g3
scaffold10x_1000_pilon  AUGUSTUS        start_codon     106074  106076  .       +       0       Parent=g3.t1
scaffold10x_1000_pilon  AUGUSTUS        CDS     106074  106640  1       +       0       ID=g3.t1.cds;Parent=g3.t1
scaffold10x_1000_pilon  AUGUSTUS        stop_codon      106638  106640  .       +       0       Parent=g3.t1

I have tried to use grep and sed commands in linux, could seem to get it done.then resolved to python( I'm new to all these)

speaking of python, I'm trying to read the file in tabs, then kind of use index base on columns, work with column 3 and 9, which I could index as data[2] and data[8].

Here is what I wrote, I know it may not work like that though, just my idea, I'm a bit new to python too.

data = open("my my_bad_gff", 'r')
new_file = ''
for line in data:

    columns = line.rstrip("\n").split("\t")

    scaffold = columns[0]
    source = columns[1]
    feature = columns[2]
    start = columns[3]
    end = columns[4]
    score = columns[5]
    strand = columns[6]
    frame = columns[7]
    attribute = columns[8]

    if feature == 'gene':  #Im trying to take the row called gene, and assign its content as x, which is g1 in this case
        a = str(columns[8])
        b = 'ID='+ a #which I think should give me ID=g1
    if feature == 'transcript':
        columns[8] = a + '.t1' + ';Parent=' + a  # hopint it gives me ID=g1.t1;Parent=g1, but how can i make sure '.t1' is not fixes, since transcript number can ncrease for each gene

    if feature == 'intron' and 'start_codon' and 'stop_codon':
        columns [8] = 'Parent=' + a + '.t1'# should give me Parent=g1.t1
        d = columns [8]
    if feature == 'CDS':
        columns[8] = a + '.t1' + 'cds;' + d #hoping this gives me ID=g1.t1.cds;Parent=g1.t1
        new_file.append(data)

would appreciate any idea that could help me. thanks

python augustus braker • 2.1k views

ADD COMMENT • link 7.3 years ago by Kay • 0

0

Entering edit mode

There is no way to activate the option --gff3=yes of augustus ? That would be convenient...

Otherwise as mentioned in the other post (C: GFF3 file incompatibilities (source: BRAD)) you could use the gxf_to_gff3.pl script from here: https://github.com/NBISweden/GAAS/tree/master/annotation/Tools/bin but you will have to install bioperl and some perl modules. Here are the explanations: https://github.com/NBISweden/GAAS/tree/master/annotation/Tools/Util/gff

Otherwise I'm sure someone will give you a nice bash command to do that in a brute approach.

ADD REPLY • link 7.3 years ago by Juke34 9.3k

0

Entering edit mode

if I run Augustus as a stand alone with the --gff3=on or =yes, not sure which now, that works, in fact, that's what im trying to confirm my output in this question to. But in Braker ( uses Genemark and Augustus), not sure I can do the --gff3 on option

ADD REPLY • link 7.3 years ago by Kay • 0

0

Entering edit mode

Truth is, I'm just still in the active learning phase regarding bioinformatics. I have checked the link above. is it possible to clone directly into the GAAS-maser using a wget command line.

I would try out the Perl option too

ADD REPLY • link 7.3 years ago by Kay • 0

0

Entering edit mode

to clone the repository: git clone https://github.com/NBISweden/GAAS.git

ADD REPLY • link 7.3 years ago by Juke34 9.3k