Longest transcript variant per gene
1
0
Entering edit mode
7 months ago
sansan96 ▴ 140

Hello everyone,

I am running orthofinder on several proteomes that I obtained after an annotation with BRAKER3 and I want to perform an analysis with orthofinder. A suggestion to make the analysis faster and more precise is to extract the longest transcript variant per gene. Orthofinder provides a script for this but it only applies to files downloaded from Ensembl. Does anyone know a tool that can help me with this?

Basically I have these files for each species:

braker.aa
braker.codingseq
braker.gff3

My protein file looks like this:

head -n 2000 braker.aa
>g176.t1
MTKLTKRLELQMESSRLGLLRSHSRARSSKLASSQSKADGPPAEPEDAAKDATKDAAKEE
KQEKSRSWFSPRRKDSVASTILRSSKSFRRLSAVSSAASSQAAISTPTTPSFSRGSYESD
AAFHHRKLSSVESGEAPLSEPAPTDPPSIPYPPHRPERPQGDLFEGTPLEKANQLEKANQ
FEKANQLEKANQPSQRPSRSRSILRPFSPFPGPSIRDLRRLSQHSEHKPQANNAADTATT
TASKANRLSEEAEQVHEPADVAMTPASPKPSTYSVLSAPAPIVPNRRASLRPSSKGSETG
LARKSSIASSFHFSIHRRPSHVTAEIESRKHARSSRWTLTENMAEMFKGQHIKTDKQAMT
PSQIEAIWNGQDNGGAAAAAAKQKQKKSKERRMKSASDTSASIPRSFGGPLTEQQTNMFS
EPFQWPDKMSSPVPMSRADIKMRALPFEITVPPPPSTILVSPEIIHGDVSPKSPTKIDRR
HMERQSVPQLVLPDTEDAAIEDDDDAASGSSIPIPPKNPARFVARAPTMPLLPPILEGFR
SPPGSNRSSNISSHRRSRSCNHASEVKDDMITFTSTPYTMANPSFRHGPIVLSDAGSRDS
VVTAEVVEEEPRDVDWTAFQTAILGGSNGDLEGLFPEEPIPADEEEGKMAEDVTTWFEGF
GFETHGELIASSEKSSEKKSDRSTQRDSAGSMTSAASTPSTVQTEAEAELQTPVTIPQHQ
NIFDTIKQLRDRCESTYSASIYTTDSAEGPWAAAGVDGEDGRKEIDELAPTSVASKQGME
HGLEAFLGFTIDDSY*


>g176.t2
MERQSVPQLVLPDTEDAAIEDDDDAASGSSIPIPPKNPARFVARAPTMPLLPPILEGFRS
PPGSNRSSNISSHRRSRSCNHASEVKDDMITFTSTPYTMANPSFRHGPIVLSDAGSRDSV
VTAEVVEEEPRDVDWTAFQTAILGGSNGDLEGLFPEEPIPADEEEGKMAEDVTTWFEGFG
FETHGELIASSEKSSEKKSDRSTQRDSAGSMTSAASTPSTVQTEAEAELQTPVTIPQHQN
IFDTIKQLRDRCESTYSASIYTTDSAEGPWAAAGVDGEDGRKEIDELAPTSVASKQGMEH
GLEAFLGFTIDDSY*
transcript longest variant orthofinder • 937 views
ADD COMMENT
3
Entering edit mode
7 months ago
Mathew ▴ 180

Hi, why not write code to extract the longest variant?

I copied these two in a txt file called "input.txt", and made three dummy variants of a gene "d163.t1" through "d163.t3", where d163.t3 is the longest of the transcripts. Thus, my output file should contain only g176.t1 (which is the longest of the two you provided), and d163.t3 (Which is the dummy gene I made that is a longer transcript). Here is some code I quickly wrote to solve this:

def extract_longest_transcript(input_file, output_file):
   gene_transcripts = {}
   current_gene = None
   current_transcript = ''
   current_variant = ''
   with open(input_file, 'r') as f:
       for line in f:
           line = line.strip()
           if line.startswith('>'):
               if current_gene:
                   if current_gene in gene_transcripts:
                       existing_transcript, _, _ = gene_transcripts[current_gene]
                       if len(current_transcript) > len(existing_transcript):
                           gene_transcripts[current_gene] = (current_transcript, len(current_transcript), current_variant)
                   else:
                       gene_transcripts[current_gene] = (current_transcript, len(current_transcript), current_variant)
               current_gene = line[1:].split('.')[0]  # Extracting the gene identifier
               current_variant = line[1:].split('.')[1] # Extracting the variant
               current_transcript = ''
           else:
               current_transcript += line

       # Last transcript
       if current_gene:
           if current_gene in gene_transcripts:
               existing_transcript, _, _ = gene_transcripts[current_gene]
               if len(current_transcript) > len(existing_transcript):
                   gene_transcripts[current_gene] = (current_transcript, len(current_transcript), current_variant)
           else:
               gene_transcripts[current_gene] = (current_transcript, len(current_transcript), current_variant)

   # Write longest transcripts to file
   with open(output_file, 'w') as out:
       for gene, (transcript, _, variant) in gene_transcripts.items():
           out.write(f'>{gene}.{variant}\n{transcript}\n')

This results in an output file containing only the longest variants for each. Now of course you have different file types and this code may not translate exactly with the rest of the formatting, but this is just to maybe spark some ideas for you.

Please also see GenoMax's post, who has linked several answers to very similar questions before.

ADD COMMENT
1
Entering edit mode

Hello again,

It works very well, thank you very much for this solution.

ADD REPLY
1
Entering edit mode

I'm glad it worked well for you! Good luck with the rest of your project. I wish you the best.

ADD REPLY
0
Entering edit mode

Thank you very much, success in your projects.

ADD REPLY
0
Entering edit mode

Hello Mathew,

Thanks for your answer, I'll try it and come back.

ADD REPLY

Login before adding your answer.

Traffic: 1773 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6