Hi all,
I have a mouse expression profile that is annotated with gene symbols and many of them are duplicated. I usually use collapseRows function with maxMean method from WGCNA package to get ride of duplicated genes. However, this time I realized that there are also some duplication in ENSEMBLE IDs. Can any help me how should I deal with this situation? Should I just simply remove duplicated ENSEMBLE ID and then use collapseRows function for duplicated genes? This is part of my data:
ENSMUSG00000019864 Rtn4ip1 3.33471 2.18619 3.52304 4.13997 2.91682 3.17805
ENSMUSG00000019864 Rtn4ip1 0.141481 0 0.126809 0.140919 0 0.159667
ENSMUSG00000019865 Nmbr 0.0325972 0 0.056908 0.0324288 0.305734 0
ENSMUSG00000019866 Crybg1 8.79001 6.82754 13.9235 15.1803 9.54965 11.3725
ENSMUSG00000019867 Gje1 0 0 0 0 0 0
as you can see for example ENSMUSG00000019864 id is duplicated with different expression value?
I really appreciate any help or suggestion!
Looks like you have Transcript expression reported, I would prefer to add the values for the same condition in the same gene, eo Rtn4ip1 should be:
What about getting average instead? Is there any r package that can do this?
If they are transcripts it would make more sense to add them to get gene expression values, since all of those sequencing reads aligned to a transcript from the same gene.
Thanks rpolicastro! Oh yes that makes more sense. Is there any package for doing this in r?
You can use dplyr. Make sure you have dplyr v1.0.0 or higher.
Where did you get the expression profile from, and/or how was it generated? It would be good to first figure out how it ended up with duplicated values.
I got from someone as she said this is FPKM data from Cufflinks pipeline.
Hi, I've got the same problem. Did you figure out how it ended up with duplicated values? I am not sure if this is transcript expression.