my file do not have duplicates, but it still shows duplicate are not allow
0
0
Entering edit mode
6.4 years ago
mikysyc2016 ▴ 120

Hi all, I check my file use which(dupilcate(file)), and i already remove duplicate with my file. But when i read my file in R, it still show as below:

 x <- read.delim("merged_6_rd.txt", row.names = 1, stringsAsFactors = FALSE)
Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  duplicate 'row.names' are not allowed

I do not know how to deal with it. Thanks,

rna-seq R • 2.5k views
ADD COMMENT
1
Entering edit mode

Assuming that you are on *nix/macOS, run following command and let us know the output:

~ $ cut -f1 merged_6_rd.txt | uniq -d

Please add sort as mentioned in Pierre post, if entries in column 1 are not sorted. If they are already sorted, you don't have to sort.

ADD REPLY
1
Entering edit mode

uniq needs a sorted input;

 cut -f1 merged_6_rd.txt | sort | uniq -d
ADD REPLY
0
Entering edit mode

Why not show counting the first column?

ADD REPLY
0
Entering edit mode

when i use

cut -f1 merged_6_rd.txt | sort | uniq -d

I get :

ID
NM_001001130
NM_001001144
NM_001001152
NM_001001160
NM_001001176
NM_001001177
NM_001001178
NM_001001180
NM_001001181
NM_001001182
NM_001001183
NM_00100118
..........
ADD REPLY
0
Entering edit mode

those are duplicated entries in your data. now do grep -i -w 'NM_001001130' merged_6_rd.txt. you should get more than one row and in first column of the resultant rows, you should see duplicate entries of NM_001001130'

ADD REPLY
0
Entering edit mode

you are right i get two :

NM_001001130    22  16  14  12  25  18  2218
NM_001001130

how i can remove the second one? Thanks!

ADD REPLY
1
Entering edit mode

well, you need to look at the other duplicate entries and see if it is the same pattern. Then one can write a script to remove empty entries. Otherwise, you need to come up with a way to handle such entries. Make a list of duplicate entries in a separate file.

If it is same pattern, see if following code works: $ awk '!a[$1]++' merged_6_rd.txt. Please validate the output for previously identified duplicates. This is on the assumption that empty lines come second when there are duplicates. If not so, try : $ awk '$2!=""' merged_6_rd.txt. This is on the assumption that duplicate lines to be removed have 2nd column empty.

ADD REPLY
0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY
0
Entering edit mode

You had good pointers on how to remove rows with duplicate names, but I feel you should investigate why you have rows with duplicate names: generally, analysis pipelines output results with unique identifiers. How was this file created?

ADD REPLY

Login before adding your answer.

Traffic: 2377 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6