I have reference annotation file of Arabidopsis thaliana and I am interested to identify extract transcipts that code for longest protein isoform and then extract coodinates of that transcript. Forexample gene (AT1G01020) contain 6 transcripts (AT1G01020.1, AT1G01020.2, AT1G01020.3, AT1G01020.4, AT1G01020.5, AT1G01020.6), how can i identify transcript which codes for longest protein and extract its coordinates?
Does it depends on number of exons, CDS regions or length of exons?
I am amble to select the protein coding transcripts but how I can select the transcrip that codes for longest protein? Seondly if multiple transcipts of variable length code for protein of similar length then which transcript should I need to select? For example gene (AT2G27490) conatin 4 transcripts of variable length but all codes for protein of 232aa so which one I need to select?
You select the larger one from the table, if you need to automatically decide, then you need to code something to query and filter your selection. Deciding which one to use if they have the same length, that is a question you need to define based on what are you trying to do with that information.
Longest transcript doesn't mean it codes for longest protein as it can aslo contain retained introns or part of introns, how can i get the idea of longest protein coding transcript?
by CDS (CoDing Sequence) length