I have 5 columns of data in csv format, the data title includes Organism, Accession_number, GeneID, Chromosome,and Protein_length, if different accession_numbers have same geneID, the line or row with longest protein length will be retained. Although there are many items and great answers about selecting longest isoforms: How to extract the longest isoform from multi fasta file how to select longest isoform per gene in gtf How to select the best isoform for a differential expressed gene in Trinity? select longest isoform from RefSeq proteome fasta longest transcript NCBI NM_ids most of them deal with one species. Here I want to retain the longest isoforms of some homologous from different species. For example, all proteins with the same GeneID 1636 in homo sapiens, which I will retain the line or row with longest Protein length of 1306 amino acids from, so does the Mus musculus with GeneID 11421, the line with longest Protein length of 1312 will be preserved, the rest of the table including Rattus norvegicus and Drosophila melanogaster will be kept it as it is, because each of them has unique GeneID. This is the example data:
Organism Accession_number GeneID Chromosome Protein_length
Homo_sapiens XP_006721800 1636 17 752
Homo_sapiens NP_690043 1636 17 732
Homo_sapiens NP_001369629 1636 17 1117
Homo_sapiens NP_001369630 1636 17 1022
Homo_sapiens NP_000780 1636 17 1306
Homo_sapiens NP_001171528 1636 17 691
Homo_sapiens NP_001369631 1636 17 511
Mus_musculus NP_033728 11421 11 732
Mus_musculus NP_997507 11421 11 1312
Mus_musculus NP_001268748 11421 11 1249
Rattus_norvegicus NP_036676 24310 10 1313
Drosophila_melanogaster NP_001260258 34189 2L 630
Drosophila_melanogaster NP_001285915 34805 2L 615
Drosophila_melanogaster NP_788042 34806 2L 611
the final result I want is like:
Organism Accession_number GeneID Chromosome Protein_length
Homo_sapiens NP_000780 1636 17 1306
Mus_musculus NP_997507 11421 11 1312
Rattus_norvegicus NP_036676 24310 10 1313
Drosophila_melanogaster NP_001260258 34189 2L 630
Drosophila_melanogaster NP_001285915 34805 2L 615
Drosophila_melanogaster NP_788042 34806 2L 611
Anyone can help me? Thanks in advance.
Sort decreasingly by
Protein_length
and then deduplicate byGeneID
, so only take the first uniqueGeneID
, this can e.g. by done bysort -u
with the proper column in the-k
option. Try it out, great exercise to learn.Thanks for kind reply. These days I learned the command sort, and I used
as your suggested to decrease by Protein_length successfully, however, when I use
no changes happened, is there a problem?
use datamash, group by GeneID, print maximum value for 5th column.
Many thanks. I am learning the amazing command datamash, and run it successfully, however, when I use
it ignored other columns, only left the deduplicated column 3 and their conresponding max value columns, I want all the columns there. Any other detailed suggestions? Thanks.
try
-f
optionIt is awesome ! I use
datamash --sort --header-out groupby 3 max 5 -f < aa.txt > ab.txt
it will creat all columns data including deduplicated column 3 and their conresponding max value columns, the header will be retained but changed name tofield-1 field-2 field-3 field-4 field-5 max(field-5)
. Nevertheless, I have already obtained the desired result, thanks!