Question

Finding number of duplicates in R

0

Entering edit mode

8.3 years ago

nkabo ▴ 80

Hello, I have a list of gene names and several features, each row represents a gene and its specialities. There are approximately 15000 rows and 11 columns. Some of the genes are encountered more than once (for example there are 4 TP53 data ) and I want to see how many times the gene name is duplicated and I want to use that value. Duplicated gene names are one under the other. As an example: Gene name: rs_id: aa change: CASP7 xx yy TP53 zz hh TP53 ff cc TP53 bb gg WNT aa dd WNT qq kk

I want to find the number of duplicate for each gene (4 for TP53 and 2 for WNT) and I also want to check the aa change for each duplicate. Is there a way to do it in R? Thanks in advance.

R • 51k views

ADD COMMENT • link updated 8.3 years ago by keith.hughitt ▴ 280 • written 8.3 years ago by nkabo ▴ 80

1

Entering edit mode

You can try library plyr, see my post on bioconductor support site:

https://support.bioconductor.org/p/71837/#71839

ADD REPLY • link 8.3 years ago by Benn 8.3k

0

Entering edit mode

Thank you for your answer, I used the code below:

library(dplyr) newdf <- df %>% group_by(ID) %>% mutate(replicate=seq(n()))

However, I want to define one number only (for example, if a gene is repeated for 6 times, it should be like 6,6,6,6,6,6 not like 1,2,3,4,5,6). Could you suggest a way to do it?

ADD REPLY • link 8.3 years ago by nkabo ▴ 80

1

Entering edit mode

Try count function from plyr.

?count

ADD REPLY • link 8.3 years ago by Benn 8.3k

score 8 · Answer 1 · 2016-08-16

You can use the table function in R to get the count of each duplicated gene.

For example, if the gene IDs are stored in a column gene_id, you could do:

> dat <- data.frame(gene_id=sample(1:3, 20, replace=TRUE), other_col='foo')
> table(dat$gene_id)

1 2 3 
5 6 9 
> as.data.frame((table(dat$gene_id)))
  Var1 Freq
1    1    5
2    2    6
3    3    9

This gives you a data.frame of the number of duplicates for each ID.

Not sure what you mean by "check the aa change for each duplicate", but presumably you could just get a list of the unique gene IDs, and then use a for-loop to iterate over them, selecting all relevant rows, and performing some operation on each group of duplicates.