Hello all, I need help to select rows from gene table, selecting only longest transcript when multiple are available. For each unique gene ID (gene), I need to extract the row with the longest transcript (length). I would like to keep the whole row because I need the ID of the protein and coordinates for further analysis. See below for a sample of my table, with data extracted from Ensembl BioMart. For example for ENSGACG00000000027.1 i would keep only this row with biggest length (3024) (ENSGACG00000000027.1 ENSGACP00000000041.1 scaffold_89 278309 290469 3024).
The whole table has about 60K lines so I could really appreciate some help sorting these. I have been trying to find this out but similar questions focus on processing fasta files.
' gene protein chr start end length
ENSGACG00000000022.1 ENSGACP00000000030.1 scaffold_89 230109 238032 2415
ENSGACG00000000022.1 ENSGACP00000000031.1 scaffold_89 230109 238032 1284
ENSGACG00000000023.1 ENSGACP00000000032.1 scaffold_89 240752 246187 864
ENSGACG00000000024.1 ENSGACP00000000033.1 scaffold_89 263731 273624 3684
ENSGACG00000000025.1 ENSGACP00000000034.1 scaffold_89 275261 277377 780
ENSGACG00000000026.1 ENSGACP00000000035.1 scaffold_1045 508 2074 678
ENSGACG00000000026.1 ENSGACP00000000036.1 scaffold_1045 508 2074 810
ENSGACG00000000027.1 ENSGACP00000000037.1 scaffold_89 278309 290469 1289
ENSGACG00000000027.1 ENSGACP00000000038.1 scaffold_89 278309 290469 1305
ENSGACG00000000027.1 ENSGACP00000000041.1 scaffold_89 278309 290469 3024
is this what you are looking for?
Thanks that works well and very fast even for the big files. One question though. When I save the output in a file, like below, and open it in excel, the last column (new column after datamash), shows up as an extra row under each original row. This doesnt seem to be an issue when I grep rows or columns through the command line. But wondering if there's better way to save stdout. There was no info in the datamash help page about this
I checked the output on windows 10 with excel (office) 2019 home and student version and not able to reproduce the issue you are reporting. Open the file in an text editor and see if there are any issues. It would also help if you could post a screenshot.
Are you open to all programming languages/methods to get this? Are you after the longest in terms of cDNA length, protein length or genomic length (including introns)?
I m happy to try most programming solutions, preferable bash or R, but could also try python/perl if you have it. Would that work?
The principle with these kinds of selections is always the same. You sort this file by decreasing length (here that would be column 6), so longer entries rank higher, and then starting from top to bottom you only keep the unique gene names. Since the file is sorted by length you will for every gene (or whatever you select) pick up the longest per unique name. Please try something, this is great learning experience.