Hello everyone,
I am running orthofinder on several proteomes that I obtained after an annotation with BRAKER3 and I want to perform an analysis with orthofinder. A suggestion to make the analysis faster and more precise is to extract the longest transcript variant per gene. Orthofinder provides a script for this but it only applies to files downloaded from Ensembl. Does anyone know a tool that can help me with this?
Basically I have these files for each species:
braker.aa
braker.codingseq
braker.gff3
My protein file looks like this:
head -n 2000 braker.aa
>g176.t1
MTKLTKRLELQMESSRLGLLRSHSRARSSKLASSQSKADGPPAEPEDAAKDATKDAAKEE
KQEKSRSWFSPRRKDSVASTILRSSKSFRRLSAVSSAASSQAAISTPTTPSFSRGSYESD
AAFHHRKLSSVESGEAPLSEPAPTDPPSIPYPPHRPERPQGDLFEGTPLEKANQLEKANQ
FEKANQLEKANQPSQRPSRSRSILRPFSPFPGPSIRDLRRLSQHSEHKPQANNAADTATT
TASKANRLSEEAEQVHEPADVAMTPASPKPSTYSVLSAPAPIVPNRRASLRPSSKGSETG
LARKSSIASSFHFSIHRRPSHVTAEIESRKHARSSRWTLTENMAEMFKGQHIKTDKQAMT
PSQIEAIWNGQDNGGAAAAAAKQKQKKSKERRMKSASDTSASIPRSFGGPLTEQQTNMFS
EPFQWPDKMSSPVPMSRADIKMRALPFEITVPPPPSTILVSPEIIHGDVSPKSPTKIDRR
HMERQSVPQLVLPDTEDAAIEDDDDAASGSSIPIPPKNPARFVARAPTMPLLPPILEGFR
SPPGSNRSSNISSHRRSRSCNHASEVKDDMITFTSTPYTMANPSFRHGPIVLSDAGSRDS
VVTAEVVEEEPRDVDWTAFQTAILGGSNGDLEGLFPEEPIPADEEEGKMAEDVTTWFEGF
GFETHGELIASSEKSSEKKSDRSTQRDSAGSMTSAASTPSTVQTEAEAELQTPVTIPQHQ
NIFDTIKQLRDRCESTYSASIYTTDSAEGPWAAAGVDGEDGRKEIDELAPTSVASKQGME
HGLEAFLGFTIDDSY*
>g176.t2
MERQSVPQLVLPDTEDAAIEDDDDAASGSSIPIPPKNPARFVARAPTMPLLPPILEGFRS
PPGSNRSSNISSHRRSRSCNHASEVKDDMITFTSTPYTMANPSFRHGPIVLSDAGSRDSV
VTAEVVEEEPRDVDWTAFQTAILGGSNGDLEGLFPEEPIPADEEEGKMAEDVTTWFEGFG
FETHGELIASSEKSSEKKSDRSTQRDSAGSMTSAASTPSTVQTEAEAELQTPVTIPQHQN
IFDTIKQLRDRCESTYSASIYTTDSAEGPWAAAGVDGEDGRKEIDELAPTSVASKQGMEH
GLEAFLGFTIDDSY*
How to extract the longest isoform from multi fasta file
How do you extract longest transcript from fasta file using length and header IDs?
Thanks so much