Hi, I tried to lift the below TAIR10 annotation with flo:
> head TAIR10_GFF3_genes.gff
Chr1 TAIR10 chromosome 1 30427671 . . . ID=Chr1;Name=Chr1
Chr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1 TAIR10 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1 TAIR10 protein 3760 5630 . + . ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1 TAIR10 exon 3631 3913 . + . Parent=AT1G01010.1
Chr1 TAIR10 five_prime_UTR 3631 3759 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3760 3913 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 3996 4276 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3996 4276 . + 2 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 4486 4605 . + . Parent=AT1G01010.1
Next, I did
> gff_remove_feats.rb chromosome TAIR10_GFF3_genes.gff > TAIR10_GFF3_genes-fix1.gff |head
Chr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1 TAIR10 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1 TAIR10 protein 3760 5630 . + . ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1 TAIR10 exon 3631 3913 . + . Parent=AT1G01010.1
Chr1 TAIR10 five_prime_UTR 3631 3759 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3760 3913 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 3996 4276 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 3996 4276 . + 2 Parent=AT1G01010.1,AT1G01010.1-Protein;
Chr1 TAIR10 exon 4486 4605 . + . Parent=AT1G01010.1
Chr1 TAIR10 CDS 4486 4605 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;
While running flo I got:
> mkdir run/TAIR10_GFF3_genes-fix1
liftOver -gff /QRISdata/Q0231/flo/tair10/TAIR10_GFF3_genes-fix1.gff run/liftover.chn run/TAIR10_GFF3_genes-fix1/lifted.gff3 run/TAIR10_GFF3_genes-fix1/unlifted.gff3
Reading liftover chains
Mapping coordinates
WARNING: -gff is not recommended.
Use 'ldHgGene -out=<file.gp>' and then 'liftOver -genePred <file.gp>'
/QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 2> unprocessed.gff | gt gff3 -tidy -sort -addids -retainids - > run/TAIR10_GFF3_genes-fix1/lifted_cleaned.gff
warning: line 1 in file "-" does not begin with "##gff-version" or "##gvf-version", create "##gff-version 3" line automatically
gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID=")
rake aborted!
Additionally, I tried:
> /QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1/lifted.gff3 | head
NbV1Ch05 TAIR10 tRNA 45037019 45037091 . + . ID=AT1G01890.1;Parent=AT1G01890;Name=AT1G01890.1;Index=1
NbV1Ch11 TAIR10 tRNA 93127111 93127183 . - . ID=AT1G02480.1;Parent=AT1G02480;Name=AT1G02480.1;Index=1
NbV1Ch02 TAIR10 tRNA 81869336 81869407 . + . ID=AT1G02600.1;Parent=AT1G02600;Name=AT1G02600.1;Index=1
NbV1Ch05 TAIR10 tRNA 97695952 97696024 . + . ID=AT1G02760.1;Parent=AT1G02760;Name=AT1G02760.1;Index=1
NbV1Ch05 TAIR10 tRNA 9913146 9913217 . - . ID=AT1G03515.1;Parent=AT1G03515;Name=AT1G03515.1;Index=1
NbV1Ch13 TAIR10 tRNA 170955340 170955411 . - . ID=AT1G03570.1;Parent=AT1G03570;Name=AT1G03570.1;Index=1
NbV1Ch15 TAIR10 tRNA 91988482 91988554 . + . ID=AT1G03640.1;Parent=AT1G03640;Name=AT1G03640.1;Index=1
NbV1Ch19 TAIR10 tRNA 17742781 17742849 . + . ID=AT1G04320.1;Parent=AT1G04320;Name=AT1G04320.1;Index=1
NbV1Ch15 TAIR10 tRNA 50880103 50880176 . + . ID=AT1G06480.1;Parent=AT1G06480;Name=AT1G06480.1;Index=1
NbV1Ch19 TAIR10 tRNA 5563896 5563968 . + . ID=AT1G06610.1;Parent=AT1G06610;Name=AT1G06610.1;Index=1
...
NbV1Ch05 TAIR10 tRNA 48760991 48761061 . + . ID=AT5G66817.1;Parent=AT5G66817;Name=AT5G66817.1;Index=1
NbV1Ch13 TAIR10 tRNA 39812638 39812709 . + . ID=AT5G67455.1;Parent=AT5G67455;Name=AT5G67455.1;Index=1
NbV1Ch06 TAIR10 gene 24097055 24097111 . + . ID=AT1G64130.1
NbV1Ch06 TAIR10 exon 24097055 24097111 . + . Parent=AT1G64130.1
NbV1Ch06 TAIR10 CDS 24097055 24097111 . + 0 Parent=AT1G64130.1,AT1G64130.1-Protein
NbV1Ch17 TAIR10 gene 18625243 18625301 . - . ID=AT2G07768.1
NbV1Ch17 TAIR10 exon 18625243 18625301 . - . Parent=AT2G07768.1
NbV1Ch17 TAIR10 CDS 18625243 18625301 . - 0 Parent=AT2G07768.1,AT2G07768.1-Protein
NbV1Ch17 TAIR10 gene 70101151 70101187 . - . ID=AT5G20570.1
NbV1Ch17 TAIR10 CDS 70101151 70101187 . - 2 Parent=AT5G20570.1,AT5G20570.1-Protein
NbV1Ch17 TAIR10 gene 70101151 70101187 . - . ID=AT5G20570.2
NbV1Ch17 TAIR10 CDS 70101151 70101187 . - 2 Parent=AT5G20570.2,AT5G20570.2-Protein
UPDATE I used this Python script provided by Philipp Bayer below:
python3 rm_rubbish.py > TAIR10_GFF3_genes-fix1.rm_rubbish.gff
and rerun flo again but I got this error:
warning: line 1 in file "-" does not begin with "##gff-version" or "##gvf-version", create "##gff-version 3" line automatically
gt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID=")
rake aborted!
Command failed with status (1): [/QRISdata/Q0231/apps/flo/gff_recover.rb ru...]
/QRISdata/Q0231/apps/flo/Rakefile:60:in `block (2 levels) in <top (required)>'
/QRISdata/Q0231/apps/flo/Rakefile:40:in `each'
/QRISdata/Q0231/apps/flo/Rakefile:40:in `block in <top (required)>'
Tasks: TOP => default
(See full trace by running task with --trace)
Additionally, I did:
/QRISdata/Q0231/apps/flo/gff_recover.rb run/TAIR10_GFF3_genes-fix1.rm_rubbish/lifted.gff3 | head
NbV1Ch03 TAIR10 three_prime_UTR 74669255 74669261 . - . Parent=AT3G06895.1
NbV1Ch02 TAIR10 five_prime_UTR 136562516 136562579 . - . Parent=AT3G13857.1
NbV1Ch06 TAIR10 gene 24097055 24097111 . + . ID=AT1G64130.1
NbV1Ch06 TAIR10 exon 24097055 24097111 . + . Parent=AT1G64130.1
NbV1Ch06 TAIR10 CDS 24097055 24097111 . + 0 Parent=AT1G64130.1,AT1G64130.1-Protein
NbV1Ch17 TAIR10 gene 18625243 18625301 . - . ID=AT2G07768.1
NbV1Ch17 TAIR10 exon 18625243 18625301 . - . Parent=AT2G07768.1
NbV1Ch17 TAIR10 CDS 18625243 18625301 . - 0 Parent=AT2G07768.1,AT2G07768.1-Protein
NbV1Ch17 TAIR10 gene 70101151 70101187 . - . ID=AT5G20570.1
NbV1Ch17 TAIR10 CDS 70101151 70101187 . - 2 Parent=AT5G20570.1,AT5G20570.1-Protein
NbV1Ch17 TAIR10 gene 70101151 70101187 . - . ID=AT5G20570.2
NbV1Ch17 TAIR10 CDS 70101151 70101187 . - 2 Parent=AT5G20570.2,AT5G20570.2-Protein
What did I miss?
Thank you in advance,
What does the output of
say?
I've just added the output into my question.
One error is that
TAIR10_GFF3_genes.gff
does not seem to be a validGFF3
file as it is missing the header, see https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md . Not sure what to say about the errorgt gff3: error: Parent "AT1G64130.1-Protein" on line 3 in file "-" was not defined (via "ID=")
. Where didTAIR10_GFF3_genes.gff
come from?The header missing is not an error here, it's just a warning, the tool add one automatically when it is missing.