Hello everyone,
When I run the bitr command, I get warning message saying that 58.9% of input gene IDs are fail to map..., like the following:
library(ggplot2)
library(clusterProfiler)
> gs.up = bitr(up, fromType="SYMBOL", toType="ENTREZID", OrgDb="org.Hs.eg.db")
'select()' returned 1:1 mapping between keys and columns
Warning message:
In bitr(up, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db") :
58.9% of input gene IDs are fail to map...
The input genes are:
> up
[1] "KRT14" "S100A8" "AC114760.2" "S100A9" "KRT6A"
[6] "KRT16" "RPL13A" "TRAC" "MALAT1" "RPL27A"
[11] "RPL7" "ACTB" "RPS2" "S100A2" "RPL21"
[16] "TMSB4X" "RPL31" "PSMA2" "CD52" "S100A4"
[21] "RPS17" "CD3D" "PFN1" "RPS20" "MYL12A"
[26] "CTSD" "GZMA" "B2M" "RPS10" "ARPC4"
[31] "CORO1A" "CCL3" "CKLF" "NDUFB8" "TMBIM4"
[36] "SH3BGRL3" "ID2" "IL32" "RPL23" "ARF6"
[41] "ARPC1B" "CCL4" "RGS1" "TMSB10" "CFL1"
[46] "IL2RG" "RPS11" "TRGC2" "IFNG" "EVL"
[51] "GMFG" "ACTG1" "RPS18" "ATP5PO" "RPLP2"
[56] "VAMP2" "GZMB" "RPL15" "S100A6" "CD63"
[61] "CD2" "CYBA" "RPL34" "NDUFA11" "CD69"
[66] "CAPG" "LGALS1" "RPL3" "MZT2B" "NEAT1"
[71] "HCST" "LSP1" "CHMP4A" "APOBEC3C" "CCL5"
Do you have any suggestions on how to address these problems?
With many thanks!
What is the source of the gene symbols you're trying to convert, i.e. what genome annotation and steps were taken to generate this?
The gene symbols are from a Seurat object.
You need to provide more information than this otherwise we won't be able to answer the question. There are multiple genome annotation sources, multiple annotation versions per species, and software used to generate the count matrices can report the features in different ways.
The dataset is in rds format. It was download from DISCO, a single-cell database (Publication: https://doi.org/10.1093/nar/gkab1020). Accordint to this DISCO paper, the datasets deposited in this database were retrived as raw counts (fastq, SRA, or bam files) from datases (GEO, ArrayExpress, GSA, and other public resources) and preprocessed. The SRA and bam files were converted to fastq file. The data were demultiplexed, and then reads tagged with a cell barcode and UMI were mapped to the human reference genome assembly hg38 using STAR (version 020201) , and then assigned to the corresponding gene (Ensembl 93) with featureCounts (version 2.0.2).
Great, thanks for the additional info 👍
One last question, are there any genes in ENSEMBL gene ID format (e.g. ENSG0000001)?