Entering edit mode
12 months ago
happyday1661
•
0
library(dplyr)
library(tibble)
library(tidyr)
df <- test %>%
mutate(row_id = model_name) %>%
pivot_wider(names_from = gene_symbol, values_from = fpkm) ###
Warning message:
Values from `fpkm` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} %>%
dplyr::group_by(model_name, row_id, gene_symbol) %>%
dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
dplyr::filter(n > 1L)
The above works, but df$gene
is a list ; not as the normal data frame;
df$geneA
[[911]]
[1] 0.32
[[912]]
[1] 0.3
[[913]]
[1] 0.14
[[914]]
[1] 0.22
[[915]]
[1] 0.31
[[916]]
[1] 0
[[917]]
[1] 0.04
Thank you very much for all your guidance!
input data download the board cell line RNAseq data https://cellmodelpassports.sanger.ac.uk/downloads
There are likely duplicate values of gene_symbol. Run the code provided in the warning message to check if this is the case.
If there are duplicates you need to decide whether duplicates should be removed or made unique (by for example adding a suffix to each one).
Thank you very much for all your guidance!
It is indeed likely to be the gene_symbol "duplicates" issues. Say model_name1 have 20,000 genes detected, model_name2 have 22,000 genes detected, when transform the data frame, it may not automatically use the union of 20,000 genes + 22,000 genes...
All the best!
Gene names are a mess. My advise is to use
geneID_geneName
so Ensembl gene ID then then an underscore followed by the gene name. That is guaranteed to be unique, and if for plotting you need gene name you just do a quick regex to take the part trailing the underscore.Thank you very much! Very helpful guidance!Appreciate!
How many models do you have in this data? Creating wider table for each model separately could be another option to not to deal with the duplicate gene IDs.
Thanks a lot!
It is 925 model_name, each with near 32,000 genes. I used the unique() to get the unique set, even there are warnings about the duplicates for fpkm values, I convert the list to normal values use
unname(unlist(x))
function to df$genes, and it seems works...Thank you very much! Appreciate!