duplicates issues when trying to convert long to wide in R
1
0
Entering edit mode
13 months ago
library(dplyr)
library(tibble)
library(tidyr)

df <- test %>%
  mutate(row_id = model_name) %>%
  pivot_wider(names_from = gene_symbol, values_from = fpkm) ### 
Warning message:
Values from `fpkm` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(model_name, row_id, gene_symbol) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)

The above works, but df$gene is a list ; not as the normal data frame;

df$geneA

[[911]]
[1] 0.32

[[912]]
[1] 0.3

[[913]]
[1] 0.14

[[914]]
[1] 0.22

[[915]]
[1] 0.31

[[916]]
[1] 0

[[917]]
[1] 0.04

enter image description here

Thank you very much for all your guidance!

input data download the board cell line RNAseq data https://cellmodelpassports.sanger.ac.uk/downloads

R • 1.2k views
ADD COMMENT
1
Entering edit mode

There are likely duplicate values of gene_symbol. Run the code provided in the warning message to check if this is the case.

df %>%
  dplyr::group_by(model_name, row_id, gene_symbol) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)

If there are duplicates you need to decide whether duplicates should be removed or made unique (by for example adding a suffix to each one).

ADD REPLY
0
Entering edit mode

Thank you very much for all your guidance!

It is indeed likely to be the gene_symbol "duplicates" issues. Say model_name1 have 20,000 genes detected, model_name2 have 22,000 genes detected, when transform the data frame, it may not automatically use the union of 20,000 genes + 22,000 genes...

All the best!

ADD REPLY
1
Entering edit mode

Gene names are a mess. My advise is to use geneID_geneName so Ensembl gene ID then then an underscore followed by the gene name. That is guaranteed to be unique, and if for plotting you need gene name you just do a quick regex to take the part trailing the underscore.

ADD REPLY
0
Entering edit mode

Thank you very much! Very helpful guidance!Appreciate!

ADD REPLY
1
Entering edit mode

How many models do you have in this data? Creating wider table for each model separately could be another option to not to deal with the duplicate gene IDs.

ADD REPLY
0
Entering edit mode

Thanks a lot!

It is 925 model_name, each with near 32,000 genes. I used the unique() to get the unique set, even there are warnings about the duplicates for fpkm values, I convert the list to normal values use unname(unlist(x)) function to df$genes, and it seems works...

Thank you very much! Appreciate!

ADD REPLY
1
Entering edit mode
13 months ago
bk11 ★ 3.0k

You can use reshape package to do it easily.

library(reshape2)

test_data=read.table("test_data.txt", h=T, sep="\t")
head(test_data)
  model_name gene_symbol fpkm
1      22RV1        A1BG 0.12
2      22RV1    A1BG-AS1 1.40
3      22RV1        A1CF 5.28
4      22RV1         A2M 1.57
5      22RV1     A2M-AS1 1.29
6      22RV1       A2ML1 0.11

reshape(test_data, idvar = "model_name", timevar = "gene_symbol", direction = "wide")

model_name fpkm.A1BG fpkm.A1BG-AS1 fpkm.A1CF fpkm.A2M fpkm.A2M-AS1 fpkm.A2ML1
1      22RV1      0.12           1.4      5.28     1.57         1.29       0.11
ADD COMMENT
1
Entering edit mode

This doesn't address the original question about duplicate values, and the reshape2 library is generally deprecated in favor of pivot_wider and pivot_longer from tidyr.

ADD REPLY

Login before adding your answer.

Traffic: 2912 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6