Question

How to solve duplicate rownames error on R

0

Entering edit mode

3.3 years ago

jbnrodriguez ▴ 30

I know this has been asked several times but I've tried a lot of solutions as suggested before and they don't work. I keep getting the below error on R (no matter however I modify the csv file) when I run the below

annotation_file <- "Best3_Abicinctus_FunctionalAnnotation.csv"
annotation_info <- read.csv(annotation_file, row.names=1, header=T)
Error in read.table(file=file,header=header,sep=sep,quote=quote, : duplicate 'row.names' are not allowed

I cannot set 'row.names=NULL' as this will screw up the data order for what I intend to do downstream. I even removed blanks/tabs from the end of every row by using sed 's/[[:blank:]]*$//'but the error doesnt go away. I tested replacing commas and spaces in all of the column entries and yet the annoying error doesn't go away. This is how first few lines of the file look like

"gene_id","name","product"
"maker-Contig673-pred_gff_AUGUSTUS-gene-1.6","stk10","Serine/threonine-protein kinase 10"
"maker-Contig204-pred_gff_AUGUSTUS-gene-3.1","ccnh","Cyclin-H"
"maker-Contig31958-pred_gff_AUGUSTUS-gene-0.7","fam136a","Protein FAM136A"
"maker-Contig31340-pred_gff_AUGUSTUS-gene-0.8","h2b","Histone H2B"

The file is available here(dropbox link) on Dropbox in case you would like to take a look. I'm on a deadline and I'm just helplessly stuck at this step. Any help would be highly appreciated.

R • 4.8k views

ADD COMMENT • link updated 3.3 years ago by Lisa Ha ▴ 120 • written 3.3 years ago by jbnrodriguez ▴ 30

0

Entering edit mode

gene_id has 216 duplicate values. Get rid of duplicate rows.

ADD REPLY • link 3.3 years ago by cpad0112 21k

0

Entering edit mode

I also tested removing all duplicates but the error doesn't go away. When I do awk 'x[$1]++ ==1 {print $1 " is duplicated"}' rmdup_Best3_Abicinctus_FunctionalAnnotation.csv I dont get anything for this new file which I believe confirms there are no more duplicates on the gene_id column

ADD REPLY • link 3.3 years ago by jbnrodriguez ▴ 30

0

Entering edit mode

how about sort -k1 test.txt| uniq

ADD REPLY • link 3.3 years ago by cpad0112 21k

0

Entering edit mode

What are you doing downstream that requires you to have rownames?

ADD REPLY • link 3.3 years ago by rpolicastro 13k

0

Entering edit mode

Cross-posted at SO:

How to solve duplicate rownames error on R

ADD REPLY • link 3.3 years ago by zx8754 12k

score 1 · Answer 1 · 2021-09-01

The awk code only recognizes fully duplicated rows, not all rows where the gene_id is duplicated. The file still contains duplicates, which is why you still get the error message. If you want to remove all gene_id duplicates, the following code works, but you will miss out on information when the gene_id is predicted to produce genes with different names/products (i.e. maker-Contig29174-pred_gff_AUGUSTUS-gene-0.4 has two name entries, unknown and dok2).

library(tidyverse)
annotation_info <- read.csv(annotation_file, header=T)
uniqAnnoInfo <- annotation_info %>% distinct(gene_id, .keep_all = TRUE) 
rownames(uniqAnnoInfo) <- uniqAnnoInfo$gene_id

score 0 · Answer 2 · 2021-09-01

0

Entering edit mode

3.3 years ago

benformatics 4.1k

Use make.unique which will add .1 etc... to the names.

annotation_info <- read.table('./test.csv',row.names=1,sep=',',header=T)
Error in read.table("./test.csv", row.names = 1, sep = ",") : 
  duplicate 'row.names' are not allowed

annotation_info <- read.table('./test.csv',sep=',',header=T)
row.names(annotation_info) <- make.unique(annotation_info[,1])
annotation_info[,1] <- NULL

ADD COMMENT • link 3.3 years ago by benformatics 4.1k

0

Entering edit mode

Although I have no idea if you need the row.names to be exact matches at some point - so this could break downstream...

ADD REPLY • link 3.3 years ago by benformatics 4.1k

0

Entering edit mode

Thanks you @benformatics but this won't work as I intend to do the following downstream and so I need exact matches to the gene_id info on sig_de_results (my list of significantly expressed genes from DESeq2 which contain the gene_id info in the first column)

sig_de_annotations <- annotation_info[rownames(sig_de_results),] 
sig_de_results <- cbind(sig_de_annotations, as.data.frame(sig_de_results)) write.csv(sig_de_results, row.names=T, file="DEGlist_Deformed_vs_Healthy.csv",)

ADD REPLY • link 3.3 years ago by jbnrodriguez ▴ 30