Question

How to handle duplicate row names in R

2

Entering edit mode

4.7 years ago

koushikayaluri ▴ 70

Hi all,

I have some RNA-Seq data and I am planning to run DESeq analysis on them and I am facing an error when I am assigning the gene names as row names it says "duplicate row names". I don't want to remove any genes is there any way to work around it? Below is an example of my data Countdata

gene          sample1    sample2    sample3
CCDC7           419        326         360
CNNM1           60         48          22
PAK6            208        200         176
RPP14           50         42          91
IDS              8         11          18
PAK6            702        802         612
CFTR            58         48          40
CNN3            1200       1224        1605
CNNM1           906        989         823

Have tried How To Deal With Duplicate Row Names Error In R way. Tried to take only gene names in separate dataframe and tried to make them as row names to this data frame.

rownames(countdata2) = make.names(countdata, unique = TRUE)

But getting an error saying "Invalid row.names length". Can anyone please guide me through? Thank you very much in advance

RNA-Seq alignment R software error • 11k views

ADD COMMENT • link updated 4.7 years ago by zx8754 12k • written 4.7 years ago by koushikayaluri ▴ 70

0

Entering edit mode

You should probably figure out why there are duplicate gene names first. Can you post the code you used to generate the count table?

ADD REPLY • link 4.7 years ago by rpolicastro 13k

0

Entering edit mode

Probably a transcript-level file has multiple rows for a gene with multiple transcripts. Leave it in the ID of your original file.

ADD REPLY • link 4.7 years ago by karl.stamm 4.1k

0

Entering edit mode

Go back and do things right with ensembl IDs. Those are always unique.

ADD REPLY • link 4.7 years ago by swbarnes2 15k

0

Entering edit mode

row.names(countdata2) <- paste0(countdata$gene, "_", seq_along(countdata$gene)) would give you unique row.names but that would no longer be the gene names themselves (unless you choose to make countdata$gene <- paste0(countdata$gene, "_", seq_along(countdata$gene)) also.

ADD REPLY • link 4.7 years ago by Dunois ★ 2.9k

score 2 · Answer 1 · 2021-01-15

Try this:

# example data with duplicated rownames
countdata <- read.table(text = "gene          sample1    sample2    sample3
CCDC7           419        326         360
CNNM1           60         48          22
PAK6            208        200         176
RPP14           50         42          91
IDS              8         11          18
PAK6            702        802         612
CFTR            58         48          40
CNN3            1200       1224        1605
CNNM1           906        989         823", header = TRUE)

# exclude gene column, as it will become rowname
countdata2 <- countdata[, -1]

# check if we have duplicates
table(duplicated(countdata$gene))
# FALSE  TRUE 
#     7     2 

# this will throw error as expected
rownames(countdata2) <- countdata$gene
# Error in `.rowNamesDF<-`(x, value = value) : 
#   duplicate 'row.names' are not allowed
# In addition: Warning message:
#   non-unique values when setting 'row.names': ‘CNNM1’, ‘PAK6’ 


# This works fine
rownames(countdata2) <- make.names(countdata$gene, unique = TRUE)

countdata2
#         sample1 sample2 sample3
# CCDC7       419     326     360
# CNNM1        60      48      22
# PAK6        208     200     176
# RPP14        50      42      91
# IDS           8      11      18
# PAK6.1      702     802     612
# CFTR         58      48      40
# CNN3       1200    1224    1605
# CNNM1.1     906     989     823