Hi community,
I have a problem trying to create a matrix using an expression data (downloaded)
My data (xx)
looks like this :
sample_id raw_read_count gene_id normalized_read_count
CPCG0402-F1 2 "DDX11L1" 0.00680125380953093
CPCG0402-F1 157 "WASH7P" 1.50386916339037
CPCG0402-F1 0 "RP11-34P13.3" 0
CPCG0402-F1 0 "FAM138A" 0
CPCG0402-F1 0 "OR4G4P" 0
CPCG0402-F 10 "OR4G11P" 0
Someone suggested to convert my table into a matrix using this code
mat <- xx %>%
select(!normalized_read_count) %>%
pivot_wider(names_from=sample_id, values_from=raw_read_count) %>%
column_to_rownames("gene_id") %>%
as.matrix
which works perfectly for other data set, but when I'm trying to run using this data I get the warning message :
Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
and of course some the output contains list-cols.
I've tried unique()
, distinct()
but it doesn't work.
I'm also trying to transform "hgnc_symbol"
to ensembl_gene_id
using biomart
. But it doesn't make any difference. Any suggestion? Thanks!!
Test the dplyr pipeline step by step. Where do the duplicates lie?
in the
pivot_wider
Try adding
id_cols=gene_id
to the function.it does not work, but thank you for the suggestion
Try adding
names_repair="unique"
topivot_wider
and then comparexx$sample_id
withcolnames(mat)
to see what's going on. Maybe multiple different delimiters are being cleaned to.
and get treated as duplicates.I found the problem but I don't know how to solve it. There are some
gene_id
that appear duplicated in the same patient. As exampleWASH7P
take value 157 and 207 for the patientCPCG0402-F1
, so can I solve this? any clue?You'll need to go back to where this data came from, because this sounds like an identifier mapping problem. Maybe these entries had different ENSG identifiers, one of which is in a canonical chromosome and the other(s) in patches/alt contigs.
Yes , I know what is the problem, but there is nothing that I can do because this is the ibky data for the study.