How to remove duplicate rows and keep highest values only for a gene list with scores
2
1
Entering edit mode
6.8 years ago
lessismore ★ 1.4k

Dear all,

i have a two columns table of 20k lines. 1st column: list of gene IDs (there can be duplicated IDs)
2nd column: a value
What i want is to rank my list leaving with only unique gene IDs. For the duplicated gene IDs i want to leave only the ones with the highest score.

here an example, Thanks in advance

TMCS09g1008699  6.4
TMCS09g1008671  6.4
TMCS09g1008672  6.5
TMCS09g1008673  6
TMCS09g1008674  5.4
TMCS09g1008675  5.4
TMCS09g1008676  4.9
TMCS09g1008677  4.6
TMCS09g1008677  4.4
TMCS09g1008679  4.3
TMCS09g1008680  3.9
TMCS09g1008681  3.8
TMCS09g1008682  3.6
TMCS09g1008683  3.5
TMCS09g1008684  3.5
TMCS09g1008685  3.4
TMCS09g1008686  3.4
TMCS09g1008687  3.4
TMCS09g1008688  3
TMCS09g1008689  2.6
TMCS09g1008690  2
TMCS09g1008699  5.9
bash R • 14k views
ADD COMMENT
3
Entering edit mode
6.8 years ago
arta ▴ 670

Suppose df is your data frame, id is first column and value is the second

df <- df[order(df$id, -abs(df$value) ), ] ### sort first
df <- df[ !duplicated(df$id), ]  ### Keep highest
ADD COMMENT
0
Entering edit mode

there's a mistake here, it takes the lowest

ADD REPLY
0
Entering edit mode

Can you try it again ?

ADD REPLY
0
Entering edit mode

when i put it into a script in this way:

test = commandArgs(trailingOnly=TRUE)

read.delim(test, header = FALSE, sep ="\t")

test[order(test$V1, -abs(test$V2) ), ] ### sort first
test[ !duplicated(test$V1), ]  ### Keep highest

i get a partial output and then this error:

Error in test$V1 : $ operator is invalid for atomic vectors
Calls: order
Execution halted

do you know what that means?

ADD REPLY
0
Entering edit mode

Can you use test["V1"] instead of test$V1 ? You get this error because test$V1 is non-recursive object. You can find more info here.

ADD REPLY
0
Entering edit mode

i did, now i get this:

Error in abs(test["V2"]) : non-numeric argument to mathematical function
Calls: order
Execution halted
ADD REPLY
0
Entering edit mode

Very useful code. Any idea how to make this working for a list of data frames???

https://pasteboard.co/JlDwYx4.png

ADD REPLY
0
Entering edit mode
6.8 years ago

Can definitely be done in bash/awk, but it'd probably take me longer to figure out when it's easy enough with python.

import sys

in_file = sys.argv[1]
out_file = sys.argv[2]

genes = {}
with open(in_file) as f:
    for line in f:
        line = line.strip().split()
        gene_id = line[0]
        val = float(line[1])

        # Find and replace dups if necessary.
        if gene_id in genes:
            if val > genes[gene_id]:
                genes[gene_id] = val
        else:
            genes[gene_id] = val

out = open(out_file, "w")

# Actually print to output.
for x in genes:
    output = "\t".join(x, str(genes[x]))
    print(output, file = out)

out.close()
ADD COMMENT
0
Entering edit mode

Sorry im not proficient in python, assuming i want to put your script in a bash for loop for a list of files, could you tell me how to complete your script?

ADD REPLY
0
Entering edit mode

If you want a single output file for each input file (rather than one for many input files), you can just run the script several times from the command line for your files - which is particularly easy if you segregate them and they have a common file extension:

for x in *.txt; do
    python my_python_script.py "$x" "$x".out
done

I've edited my answer slightly to set up the input and output files.

ADD REPLY
0
Entering edit mode

i get this error

  File "filter_unique_best_score.py", line 25
    print(output, file = out)
                       ^
SyntaxError: invalid syntax
ADD REPLY
0
Entering edit mode

Are you using python 3? If you're using python 2, try adding this as the first line in the script: from __future__ import print_function.

ADD REPLY
0
Entering edit mode

unfortunately yes.

i did and i get this

Traceback (most recent call last):
  File "filter_unique_best_score.py", line 26, in <module>
    output = "\t".join(x, string(genes[x]))
NameError: name 'string' is not defined
ADD REPLY
0
Entering edit mode

Oh, that's my mistake. Mixing languages too frequently, python uses str for string conversions, not string. I've updated the code.

ADD REPLY

Login before adding your answer.

Traffic: 1861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6