Question

how to get ride of duplicated genes when we also have duplicated Ensemble ID in the expression profile?

0

Entering edit mode

4.5 years ago

Raheleh ▴ 260

Hi all,

I have a mouse expression profile that is annotated with gene symbols and many of them are duplicated. I usually use collapseRows function with maxMean method from WGCNA package to get ride of duplicated genes. However, this time I realized that there are also some duplication in ENSEMBLE IDs. Can any help me how should I deal with this situation? Should I just simply remove duplicated ENSEMBLE ID and then use collapseRows function for duplicated genes? This is part of my data:

ENSMUSG00000019864  Rtn4ip1 3.33471 2.18619 3.52304 4.13997 2.91682 3.17805
ENSMUSG00000019864  Rtn4ip1 0.141481    0   0.126809    0.140919    0   0.159667
ENSMUSG00000019865  Nmbr    0.0325972   0   0.056908    0.0324288   0.305734    0
ENSMUSG00000019866  Crybg1  8.79001 6.82754 13.9235 15.1803 9.54965 11.3725
ENSMUSG00000019867  Gje1    0   0   0   0   0   0

as you can see for example ENSMUSG00000019864 id is duplicated with different expression value?

I really appreciate any help or suggestion!

RNA-Seq duplicated ENSEMBLE ID collapseRows • 2.3k views

ADD COMMENT • link updated 3.3 years ago by fana ▴ 40 • written 4.5 years ago by Raheleh ▴ 260

2

Entering edit mode

Looks like you have Transcript expression reported, I would prefer to add the values for the same condition in the same gene, eo Rtn4ip1 should be:

ENSMUSG00000019864  Rtn4ip1 3.33471+0.141481 2.18619+0 3.52304+0.126809  4.13997+0.140919  2.91682+0 3.17805+0.159667

ADD REPLY • link 4.5 years ago by JC 13k

0

Entering edit mode

What about getting average instead? Is there any r package that can do this?

ADD REPLY • link 4.5 years ago by Raheleh ▴ 260

1

Entering edit mode

If they are transcripts it would make more sense to add them to get gene expression values, since all of those sequencing reads aligned to a transcript from the same gene.

ADD REPLY • link 4.5 years ago by rpolicastro 13k

0

Entering edit mode

Thanks rpolicastro! Oh yes that makes more sense. Is there any package for doing this in r?

ADD REPLY • link 4.5 years ago by Raheleh ▴ 260

1

Entering edit mode

You can use dplyr. Make sure you have dplyr v1.0.0 or higher.

library("dplyr")

df <- df %>%
  group_by(across(c(1, 2))) %>%
  summarize(across(everything(), sum))

ADD REPLY • link 4.5 years ago by rpolicastro 13k

0

Entering edit mode

Where did you get the expression profile from, and/or how was it generated? It would be good to first figure out how it ended up with duplicated values.

ADD REPLY • link 4.5 years ago by rpolicastro 13k

0

Entering edit mode

I got from someone as she said this is FPKM data from Cufflinks pipeline.

ADD REPLY • link 4.5 years ago by Raheleh ▴ 260

0

Entering edit mode

Hi, I've got the same problem. Did you figure out how it ended up with duplicated values? I am not sure if this is transcript expression.

ADD REPLY • link 3.3 years ago by fana ▴ 40