Hi,
How can I generate a GFF file from a protein (or CDS) and a genome assembly fasta files accurately? I'm wondering if this even possible. This genome has around 33,000 genes.
My protein fasta file:
>Lup000001.1 locus=Scaffold_1:18368:20288:- [translate_table: standard]
METEERNQRGLKGSEPELFLQWGNRKRLRCVRLKDPRISSRLNGGIRKKL
TVAPSGVTVLEKEGSHLHHQQQPNRFTRNSDGSVHRSAAVDNRKSTSPEK
EDRYYTTRGSSVVADESHSKLTGDREERALVWPKLYITLSSKEKEEDFLA
MKGCKLPHRPKKRAKIIQRSLLLVSPGAWLTDMCQERYEVREKKSNKKRP
RGLKAMGSMESDSE
>Lup000002.1 locus=Scaffold_1:58782:58961:+ [translate_table: standard]
MEALNMKVFLALMVAMLVMAATSVSAAEAPAPSPTSDATTLFIPTAFASL
IALAFGLLF
My CDS fasta file:
>Lup000001.1 locus=Scaffold_1:18368:20288:-
ATGGAAACAGAAGAGAGGAACCAGAGAGGGTTAAAAGGCTCAGAGCCAGA
GCTTTTCTTGCAGTGGGGAAACAGAAAGAGACTGAGATGTGTGAGGCTTA
AGGACCCTCGGATTTCATCAAGACTCAACGGTGGGATCAGAAAAAAGCTC
ACTGTTGCTCCTTCTGGAGTTACTGTTTTGGAGAAAGAAGGTTCTCATCT
TCACCATCAACAACAACCTAATCGTTTCACAAGGAATTCTGATGGTTCTG
TTCACCGGTCAGCCGCTGTAGATAATCGGAAATCAACTTCACCGGAGAAG
GAAGACCGGTACTACACCACAAGGGGATCGTCGGTGGTAGCGGATGAGAG
CCACAGCAAACTCACTGGTGACAGAGAAGAAAGAGCGCTTGTGTGGCCAA
AGCTTTACATCACCCTTTCAAGCAAGGAGAAAGAAGAAGATTTTCTTGCC
ATGAAAGGTTGCAAGCTTCCCCATAGACCCAAAAAGAGGGCCAAAATTAT
CCAAAGAAGCTTACTTTTGGTGAGTCCTGGAGCATGGTTAACTGATATGT
GCCAAGAGAGATATGAAGTTAGGGAGAAGAAAAGTAACAAGAAGAGGCCA
AGAGGATTGAAGGCAATGGGGAGTATGGAAAGTGATTCTGAATGA
>Lup000002.1 locus=Scaffold_1:58782:58961:+
ATGGAGGCATTGAACATGAAGGTTTTCTTGGCTTTGATGGTAGCCATGTT
GGTGATGGCAGCAACAAGTGTGTCAGCTGCTGAGGCACCAGCTCCAAGCC
CTACATCTGATGCTACCACTCTTTTCATTCCAACTGCTTTTGCTTCTCTC
ATTGCTCTTGCATTTGGGCTTCTCTTTTGA
My genome assembly fasta file:
>NLL-01
TACTGGTCCGAAAGGGCATGGGTTCGAATCCCATTCTTGACATTACATTTTATTTTCTAAATCAAAAACATTGCTATCCATGTTACATTGACTTGTTTGA
CAAAATTGTCAGTTGCTCTATTTCAAAATAATTTTCTAGTTAGACAATAAAAAATTGTTTGAAAATATTTTCGTTACAATATAAAATGATAAACTTTTAA
ATTTTAATAATTTCTTATGAAGATAATAAATTGTGGAACGTGCATGGAAAAGTGAAATGGATGGATGAGGATTTAATGTTTTATTAAATGCATGGAAGGA
GGGCGTTGAATTCATAATCTGACACAGTACTGTGAAAACTGAAAAGCCTATGCAAGTTAGTAGATTCGACGCATTTATGACAAATAAATTCCTTCAACTT
TACCAAGTACATGGAATAAAAAAAGTAATAAAAGTAAATAATGAATAATATAATATATAATATAAATGTTTATAGTAAAAAAATAGGGAATGGATAGAAT
Thank you,
Rom
That's wonderful! I am wondering how accurate Miniprot would be in a plant genome with lots of duplicated regions, where around 16% of genes are duplicated into genes with similar but not exactly the same sequences.Thanks for your help.