I have a large file contains genome annotation information of Arabidopsis thaliana obtained via Biomart. I want to extract transcripts and corresponding information which have highest CDS length (means those transcripts which codes for longest proteins). The file contains 34 columns and n rows, I am pasting first 7 columns and information of just 2 genes for sake of simplicity:
Gene stable ID Transcript stable ID Protein stable ID CDS Length Chromosome Gene start Gene end
AT1G01030 AT1G01030.1 AT1G01030.1 1077 1 11649 13714
AT1G01030 AT1G01030.1 AT1G01030.1 1077 1 11649 13714
AT1G01030 AT1G01030.2 AT1G01030.2 1008 1 11649 13714
AT1G01030 AT1G01030.2 AT1G01030.2 1008 1 11649 13714
AT1G01030 AT1G01030.2 AT1G01030.2 1008 1 11649 13714
AT1G01110 AT1G01110.1 AT1G01110.1 1095 1 51953 54737
AT1G01110 AT1G01110.1 AT1G01110.1 1095 1 51953 54737
AT1G01110 AT1G01110.1 AT1G01110.1 1095 1 51953 54737
AT1G01110 AT1G01110.1 AT1G01110.1 1095 1 51953 54737
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737
Now I am interested only those transcripts which have maximum CDS length, so the desired output should be:
Gene stable ID Transcript stable ID Protein stable ID CDS Length Chromosome Gene start Gene end
AT1G01030 AT1G01030.1 AT1G01030.1 1077 1 11649 13714
AT1G01030 AT1G01030.1 AT1G01030.1 1077 1 11649 13714
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737
AT1G01110 AT1G01110.2 AT1G01110.2 1584 1 51953 54737