Question

Issues with Mixture file when using CIBERSORTx

1

Entering edit mode

15 months ago

mateomejias • 0

Hi,

I am trying to run a deconvolution analysis of bulk-RNAseq samples using the LM22 signature matrix provided. I converted all ENSEMBL ID's to their Symbol, and removed NA and duplicated entries.

counts_salmon <- as.data.frame(txi$counts)

counts_salmon$symbol <- mapIds(org.Hs.eg.db,
                            keys = rownames(counts_salmon),
                            column = "SYMBOL",
                            keytype = "ENSEMBL")
counts_salmon <- counts_salmon  |>
  distinct(symbol, .keep_all = T) |>
  rownames_to_column(var = "ensbl") |>
  select(-ensbl) |>
  filter(!is.na(symbol)) |>
  column_to_rownames(var = "symbol")

counts_salmon <- na.omit(counts_salmon )

write.table(counts_salmon , file = 'output/counts_salmon.tsv', append = FALSE, sep = "\t", 
            row.names = TRUE, col.names = TRUE, quote = FALSE)

The output is a .tsv without double quotation:

Genes   rna_11  RNA_26  RNA_8   RNA_16  RNA_19  rna_47  RNA_3   RNA_24
TSPAN6  0   0   8   0   249.567 76.756  26.741  308.308
TNMD    0   0   0   0   0   0   38  0
DPM1    58.092  31.013  0   67  303.226 570.16  48.289  1078.792
SCYL3   39.036  42.86   0   0   27.801  146.749 7   414.861
C1orf112    14  1   0   0   38.234  91.923  87.89   165.261
FGR 0   47  0   1   25  69  0   158
...

I'm using this file as a input for my mixture file in CIBERSORTx, with the following parameteres:

[Options] perm: 1
[Options] verbose: TRUE
[Options] rmbatchBmode: TRUE
[Options] QN: FALSE
[Options] outdir: files/mam9823@med.cornell.edu/results/
[Options] label: Job11
=============CIBERSORTx Settings===============
Mixture file: files/mam9823@med.cornell.edu/counts_salmon.tsv 
Signature matrix file: files/common/LM22.update-gene-symbols.txt 
Number of permutations set to: 1 
Enable verbose output
Do B-mode batch correction
==================CIBERSORTx===================
All done.

However, I keep getting this error:

Error: $ operator is invalid for atomic vectors
In addition: Warning messages:
1: In CIBERSORTxFractions(sigmatrix = sigmatrix, mixture = mixture,  :
  22292 duplicated gene symbol(s) found in mixture file!
2: In mclapply(1:svn_itor, res, mc.cores = svn_itor) :
  all scheduled cores encountered errors in user code
Execution halted

Thanks a lot in advance for any help!

Deconvolution CIBERSORTx • 1.6k views

ADD COMMENT • link updated 7 months ago by vjanve • 0 • written 15 months ago by mateomejias • 0

0

Entering edit mode

I got exactly the same error message. So I am curious to see how other people solved this... Did you already contact the authors about this?

ADD REPLY • link 14 months ago by pmonsieu • 0

2

Entering edit mode

Hello! I met the same error too, but managed to find out how it happens. The "duplicated gene symbol(s)" in the error message is actually referring the first column (NOT row names) of your mixture file, which means it recognized your first column of expression data as row names (gene symbol) by mistake. This is the probably cause: when you're running "write.table" with R, the argument "row.names = TRUE" will generate a line (the REAL first column) WITHOUT column name. Because the REAL first column doesn't have a column name (the column name is blank or empty so the REAL first column is omitted), the error occurs. Here's my solution (It WORKS): mixture_file <- cbind(rownames(mixture_file),mixture_file) write.table(mixture_file, file = "mixture_file.txt", sep = "\t", row.names = FALSE, col.names = TRUE,quote=FALSE)

ADD REPLY • link 13 months ago by Hongjin ▴ 20

score 0 · Answer 1 · 2024-05-04

I ran into same issue of "duplicated gene symbol(s)": 1: In CIBERSORTxFractions(sigmatrix = sigmatrix, mixture = mixture, : 22292 duplicated gene symbol(s) found in mixture file!

To address this I checked the multiple entries in gene symbol column and found:

gene_id gene_symbol

1 ENSG00000228037

2 ENSG00000080947 CROCCP3

3 ENSG00000215908 CROCCP2

4 ENSG00000240356 RPL23AP7

5 ENSG00000161912 ADCY10P1

6 ENSG00000168255 POLR2J3

7 ENSG00000182487 NCF1B

8 ENSG00000258086 GPR84-AS1

9 ENSG00000206149 HERC2P9

10 ENSG00000183604 SMG1P5

11 ENSG00000261556 SMG1P7

Notice: the missing gene_symbol (" " entry) is also considered duplicate entry. Once these were removed the program did not give this error and ran fine. Hope this helps.

Here are the duplicate counts for these gene_id : gene_ids: ENSG00000228037, ENSG00000265961, ENSG00000277141, ENSG00000080947, ENSG00000161912, ENSG00000168255, ENSG00000182487, ENSG00000183604, ENSG00000206149, ENSG00000215908, ENSG00000240356 , ENSG00000258086, ENSG00000261556, ENSG00000278757,

occurrence count: 3442, 33 , 15 , 2 , 2 , 2 , 2 , 2 , 2 , 2 , 2 ,2 , 2, 2 ,