Question

Extract longest transcript from Ensembl gene table

0

Entering edit mode

3.9 years ago

Midge • 0

Hello all, I need help to select rows from gene table, selecting only longest transcript when multiple are available. For each unique gene ID (gene), I need to extract the row with the longest transcript (length). I would like to keep the whole row because I need the ID of the protein and coordinates for further analysis. See below for a sample of my table, with data extracted from Ensembl BioMart. For example for ENSGACG00000000027.1 i would keep only this row with biggest length (3024) (ENSGACG00000000027.1 ENSGACP00000000041.1 scaffold_89 278309 290469 3024).

The whole table has about 60K lines so I could really appreciate some help sorting these. I have been trying to find this out but similar questions focus on processing fasta files.

' gene  protein chr start   end length
ENSGACG00000000022.1    ENSGACP00000000030.1    scaffold_89 230109  238032  2415
ENSGACG00000000022.1    ENSGACP00000000031.1    scaffold_89 230109  238032  1284
ENSGACG00000000023.1    ENSGACP00000000032.1    scaffold_89 240752  246187  864
ENSGACG00000000024.1    ENSGACP00000000033.1    scaffold_89 263731  273624  3684
ENSGACG00000000025.1    ENSGACP00000000034.1    scaffold_89 275261  277377  780
ENSGACG00000000026.1    ENSGACP00000000035.1    scaffold_1045   508 2074    678
ENSGACG00000000026.1    ENSGACP00000000036.1    scaffold_1045   508 2074    810
ENSGACG00000000027.1    ENSGACP00000000037.1    scaffold_89 278309  290469  1289
ENSGACG00000000027.1    ENSGACP00000000038.1    scaffold_89 278309  290469  1305
ENSGACG00000000027.1    ENSGACP00000000041.1    scaffold_89 278309  290469  3024

bash gene ensembl R • 1.3k views

ADD COMMENT • link 3.9 years ago by Midge • 0

1

Entering edit mode

is this what you are looking for?

$ datamash -sHf -g 1 max 6 < table.txt 

gene    protein chr start   end length  max(length)
ENSGACG00000000022.1    ENSGACP00000000030.1    scaffold_89 230109  238032  2415    2415
ENSGACG00000000023.1    ENSGACP00000000032.1    scaffold_89 240752  246187  864 864
ENSGACG00000000024.1    ENSGACP00000000033.1    scaffold_89 263731  273624  3684    3684
ENSGACG00000000025.1    ENSGACP00000000034.1    scaffold_89 275261  277377  780 780
ENSGACG00000000026.1    ENSGACP00000000036.1    scaffold_1045   508 2074    810 810
ENSGACG00000000027.1    ENSGACP00000000041.1    scaffold_89 278309  290469  3024    3024

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

Thanks that works well and very fast even for the big files. One question though. When I save the output in a file, like below, and open it in excel, the last column (new column after datamash), shows up as an extra row under each original row. This doesnt seem to be an issue when I grep rows or columns through the command line. But wondering if there's better way to save stdout. There was no info in the datamash help page about this

datamash -sHf -g 1 max 6 < input.txt > output.txt

ADD REPLY • link 3.9 years ago by Midge • 0

0

Entering edit mode

I checked the output on windows 10 with excel (office) 2019 home and student version and not able to reproduce the issue you are reporting. Open the file in an text editor and see if there are any issues. It would also help if you could post a screenshot.

ADD REPLY • link 3.9 years ago by cpad0112 21k

0

Entering edit mode

Are you open to all programming languages/methods to get this? Are you after the longest in terms of cDNA length, protein length or genomic length (including introns)?

ADD REPLY • link 3.9 years ago by Emily 24k

0

Entering edit mode

I m happy to try most programming solutions, preferable bash or R, but could also try python/perl if you have it. Would that work?

ADD REPLY • link 3.9 years ago by Midge • 0

0

Entering edit mode

The principle with these kinds of selections is always the same. You sort this file by decreasing length (here that would be column 6), so longer entries rank higher, and then starting from top to bottom you only keep the unique gene names. Since the file is sorted by length you will for every gene (or whatever you select) pick up the longest per unique name. Please try something, this is great learning experience.

ADD REPLY • link 3.9 years ago by ATpoint 86k

score 0 · Answer 1 · 2021-02-04

0

Entering edit mode

3.9 years ago

Emily 24k

This looks like a way to do it in R.

ADD COMMENT • link 3.9 years ago by Emily 24k