How to remove exact duplicate value from table
2
0
Entering edit mode
7.0 years ago
raya.girish ▴ 40

hi all i have table i want to remove duplicates so for example DDC DDC and DDC_Average i only want DDC_Average in column

DDC 1644    205311_at   2.0444143854    1.7238224669

DDC 1644    214347_s_at 1.5843940083    1.155809

DDC_Average NA  NA  1.8144041969    1.4398157335

DDHD2   23259   212690_at   7.6609683554    8.024884

DDIT3   1649    209383_at   7.5272440019    7.260298

DDIT4   54541   202887_s_at 10.8441113022   11.3783353789

DDN 23109   214788_x_at 1.4528494554    1.198107

DDO 8528    207418_s_at 2.2643702324    2.213884

DDO 8528    222134_at   2.1829266013    2.455178

DDO_Average NA  NA  2.2236484169    2.334531

DDOST   1650    208674_x_at 11.5236078099   11.38367

DDOST   1650    208675_s_at 11.5512693309   11.48695

DDOST_Average   NA  NA  11.5374385704   11.43531
Column duplicatereomve GeneExpression • 1.9k views
ADD COMMENT
0
Entering edit mode
7.0 years ago

output: (If you want to remove all the duplicated rows. For eg for DDC, DDC, DDC_Average, keep only DDC_Average)

> test[ !(test$V1 %in% test$V1[duplicated(test$V1)]), ]
              V1    V2          V3        V4        V5
3    DDC_Average    NA        <NA>  1.814404  1.439816
4          DDHD2 23259   212690_at  7.660968  8.024884
5          DDIT3  1649   209383_at  7.527244  7.260298
6          DDIT4 54541 202887_s_at 10.844111 11.378335
7            DDN 23109 214788_x_at  1.452849  1.198107
10   DDO_Average    NA        <NA>  2.223648  2.334531
13 DDOST_Average    NA        <NA> 11.537439 11.435310

output: (If you want to remove all the duplicated rows, but keep only one record of duplicated rows. For eg for DDC, DDC, DDC_Average, keep only one DDC and DDC_Average)

> test[!duplicated(test[,1]),]
              V1    V2          V3        V4        V5
1            DDC  1644   205311_at  2.044414  1.723822
3    DDC_Average    NA        <NA>  1.814404  1.439816
4          DDHD2 23259   212690_at  7.660968  8.024884
5          DDIT3  1649   209383_at  7.527244  7.260298
6          DDIT4 54541 202887_s_at 10.844111 11.378335
7            DDN 23109 214788_x_at  1.452849  1.198107
8            DDO  8528 207418_s_at  2.264370  2.213884
10   DDO_Average    NA        <NA>  2.223648  2.334531
11         DDOST  1650 208674_x_at 11.523608 11.383670
13 DDOST_Average    NA        <NA> 11.537439 11.435310

input:

> test
              V1    V2          V3        V4        V5
1            DDC  1644   205311_at  2.044414  1.723822
2            DDC  1644 214347_s_at  1.584394  1.155809
3    DDC_Average    NA        <NA>  1.814404  1.439816
4          DDHD2 23259   212690_at  7.660968  8.024884
5          DDIT3  1649   209383_at  7.527244  7.260298
6          DDIT4 54541 202887_s_at 10.844111 11.378335
7            DDN 23109 214788_x_at  1.452849  1.198107
8            DDO  8528 207418_s_at  2.264370  2.213884
9            DDO  8528   222134_at  2.182927  2.455178
10   DDO_Average    NA        <NA>  2.223648  2.334531
11         DDOST  1650 208674_x_at 11.523608 11.383670
12         DDOST  1650 208675_s_at 11.551269 11.486950
13 DDOST_Average    NA        <NA> 11.537439 11.435310
ADD COMMENT
0
Entering edit mode

Hi cpad0112 i am getting result but this not what i want Example DDC_Average is the average of DDC and DDC So i no longer need this values(DDC and DDC ) i only want DDC_Average . Similary for other as well for example DDO and DDOST

ADD REPLY
0
Entering edit mode

updated the code and added few lines below: (first one should give you the output you wanted):

# Loads sqldf library
> library(sqldf)
# Outputs only those entries present exactly once based on column V1.
> sqldf('select * from test group by V1 having count(V1) == 1')
# Outputs only those entries present exactly once + one record of repeated/multiplicated rows
> sqldf('select * from test group by V1)
# Outputs only those records with text "avarage"
> sqldf('select * from test where V1 like "%aver%"')

In python3.6:

>>> import os
>>> import pandas as pd
>>> from pysqldf import SQLDF
>>> sqldf = SQLDF(globals())
>>> df=pd.read_csv("test.txt", sep="\t", header=None)
>>> sqldf.execute("SELECT * FROM df group by c0")
>>> sqldf("SELECT * FROM df group by c0 having count(c0) == 1")
>>> sqdf.execute('select * from df where c0 like "%Aver%"')
ADD REPLY
0
Entering edit mode
7.0 years ago
EagleEye 7.6k

I hope I understood your question,

grep "_Average" file.txt

If you have duplicate exact entries even with "*_Average",

cat file.txt | grep "_Average" | awk '!x[0]++'
ADD COMMENT

Login before adding your answer.

Traffic: 3097 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6