Question

How to remove duplicate rows and keep highest values only for a gene list with scores

1

Entering edit mode

7.4 years ago

lessismore ★ 1.4k

Dear all,

i have a two columns table of 20k lines. 1st column: list of gene IDs (there can be duplicated IDs)
2nd column: a value
What i want is to rank my list leaving with only unique gene IDs. For the duplicated gene IDs i want to leave only the ones with the highest score.

here an example, Thanks in advance

TMCS09g1008699  6.4
TMCS09g1008671  6.4
TMCS09g1008672  6.5
TMCS09g1008673  6
TMCS09g1008674  5.4
TMCS09g1008675  5.4
TMCS09g1008676  4.9
TMCS09g1008677  4.6
TMCS09g1008677  4.4
TMCS09g1008679  4.3
TMCS09g1008680  3.9
TMCS09g1008681  3.8
TMCS09g1008682  3.6
TMCS09g1008683  3.5
TMCS09g1008684  3.5
TMCS09g1008685  3.4
TMCS09g1008686  3.4
TMCS09g1008687  3.4
TMCS09g1008688  3
TMCS09g1008689  2.6
TMCS09g1008690  2
TMCS09g1008699  5.9

bash R • 14k views

ADD COMMENT • link updated 7.4 years ago by arta ▴ 670 • written 7.4 years ago by lessismore ★ 1.4k

score 3 · Answer 1 · 2018-03-16

3

Entering edit mode

7.4 years ago

arta ▴ 670

Suppose df is your data frame, id is first column and value is the second

df <- df[order(df$id, -abs(df$value) ), ] ### sort first
df <- df[ !duplicated(df$id), ]  ### Keep highest

ADD COMMENT • link 7.4 years ago by arta ▴ 670

0

Entering edit mode

there's a mistake here, it takes the lowest

ADD REPLY • link 7.4 years ago by lessismore ★ 1.4k

0

Entering edit mode

Can you try it again ?

ADD REPLY • link 7.4 years ago by arta ▴ 670

0

Entering edit mode

when i put it into a script in this way:

test = commandArgs(trailingOnly=TRUE)

read.delim(test, header = FALSE, sep ="\t")

test[order(test$V1, -abs(test$V2) ), ] ### sort first
test[ !duplicated(test$V1), ]  ### Keep highest

i get a partial output and then this error:

Error in test$V1 : $ operator is invalid for atomic vectors
Calls: order
Execution halted

do you know what that means?

ADD REPLY • link 7.3 years ago by lessismore ★ 1.4k

0

Entering edit mode

Can you use test["V1"] instead of test$V1 ? You get this error because test$V1 is non-recursive object. You can find more info here.

ADD REPLY • link 7.4 years ago by arta ▴ 670

0

Entering edit mode

i did, now i get this:

Error in abs(test["V2"]) : non-numeric argument to mathematical function
Calls: order
Execution halted

ADD REPLY • link 7.3 years ago by lessismore ★ 1.4k

0

Entering edit mode

Very useful code. Any idea how to make this working for a list of data frames???

https://pasteboard.co/JlDwYx4.png

ADD REPLY • link 5.0 years ago by thanos_docp • 0

score 0 · Answer 2 · 2018-03-16

0

Entering edit mode

7.4 years ago

jared.andrews07 ★ 19k

Can definitely be done in bash/awk, but it'd probably take me longer to figure out when it's easy enough with python.

import sys

in_file = sys.argv[1]
out_file = sys.argv[2]

genes = {}
with open(in_file) as f:
    for line in f:
        line = line.strip().split()
        gene_id = line[0]
        val = float(line[1])

        # Find and replace dups if necessary.
        if gene_id in genes:
            if val > genes[gene_id]:
                genes[gene_id] = val
        else:
            genes[gene_id] = val

out = open(out_file, "w")

# Actually print to output.
for x in genes:
    output = "\t".join(x, str(genes[x]))
    print(output, file = out)

out.close()

ADD COMMENT • link 7.3 years ago by jared.andrews07 ★ 19k

0

Entering edit mode

Sorry im not proficient in python, assuming i want to put your script in a bash for loop for a list of files, could you tell me how to complete your script?

ADD REPLY • link 7.4 years ago by lessismore ★ 1.4k

0

Entering edit mode

If you want a single output file for each input file (rather than one for many input files), you can just run the script several times from the command line for your files - which is particularly easy if you segregate them and they have a common file extension:

for x in *.txt; do
    python my_python_script.py "$x" "$x".out
done

I've edited my answer slightly to set up the input and output files.

ADD REPLY • link 7.3 years ago by jared.andrews07 ★ 19k

0

Entering edit mode

i get this error

  File "filter_unique_best_score.py", line 25
    print(output, file = out)
                       ^
SyntaxError: invalid syntax

ADD REPLY • link 7.3 years ago by lessismore ★ 1.4k

0

Entering edit mode

Are you using python 3? If you're using python 2, try adding this as the first line in the script: from __future__ import print_function.

ADD REPLY • link 7.3 years ago by jared.andrews07 ★ 19k

0

Entering edit mode

unfortunately yes.

i did and i get this

Traceback (most recent call last):
  File "filter_unique_best_score.py", line 26, in <module>
    output = "\t".join(x, string(genes[x]))
NameError: name 'string' is not defined

ADD REPLY • link 7.3 years ago by lessismore ★ 1.4k

0

Entering edit mode

Oh, that's my mistake. Mixing languages too frequently, python uses str for string conversions, not string. I've updated the code.

ADD REPLY • link 7.3 years ago by jared.andrews07 ★ 19k