Question

Microarray - multiple probe-ids matching to the same gene symbol but different ensembl_gene_id

0

Entering edit mode

21 months ago

manaswwm ▴ 560

Hello all,

Newbie in microarray analysis here - I am currently trying to do some differential analysis from some microarray data (Affymetrix). I know that the probe used in the experiment was HG U95A. I am currently trying to identify the corresponding ensembl_gene_ids for every probe id using this biomaRt code:

library(biomaRt)

#declaring hsap mart
hsap_mart = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")

#extracting the gene symbols and geneIDs based on the affymetrix probe ID
affy_probe_genenames = getBM(attributes = c("ensembl_gene_id", "affy_hg_u95a"),
                             filters = "affy_hg_u95a", values = "1007_s_at",
                             mart = hsap_mart, useCache = FALSE)

I notice that for probe 1007_s_at I get the following 5 ensembl_gene_ids - "ENSG00000234078", "ENSG00000137332", "ENSG00000230456", "ENSG00000215522" and "ENSG00000204580"

Since there is only one corresponding expression value for 1007_s_at in the dataset, I was wondering how the choice is usually made on the corresponding ensembl_gene_id in (for example in this case, multiple gene ids per probe id).

All the 5 ensemble gene ids do seem to have the same gene symbol (DDR1).

Thanks in advance!

microarray affymetrix • 1.7k views

ADD COMMENT • link 20 months ago by manaswwm ▴ 560

score 1 · Answer 1 · 2023-08-04

Hello,

It appears to be related to the fact that, at this locus, there are alternate haplotype sequences, which each have their own ENSG ID for this gene. One can take a look at the locus targeted by this probe at the UCSC Genome Browser: https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&las...

The alternative sequences in question are labelled 'chr6_GL000251v2_alt'.

The 'true' ID of the gene seems to be ENSG00000204580, based on Ensembl and UCSC records:

I am unsure how to automate the correct selection of the ENSG ID in cases like this. Perhaps if you pull also, via biomaRt, the contig / chromosome, it will reflect there. Or, you could in addition pull entrezgene_id and, hopefully, it will be blank for those ENSG IDs that are on the alternate sequences.

Kind regards,

Kevin

score 1 · Answer 2 · 2023-08-04

You could choose the longest gene and that has cytogenetic band. In your case ENSG00000204580 is the longest having band information.

hsap_mart = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")

affy_probe_genenames = getBM(attributes = c("ensembl_gene_id", "affy_hg_u95a","hgnc_symbol", "chromosome_name", "band",'entrezgene_id',"start_position","end_position"),
                         filters = "affy_hg_u95a", values = "1007_s_at",
                        mart = hsap_mart, useCache = FALSE)
affy_probe_genenames$size=affy_probe_genenames$end_position - affy_probe_genenames$start_position
affy_probe_genenames

ensembl_gene_id affy_hg_u95a hgnc_symbol      chromosome_name   band entrezgene_id start_position end_position  size
1 ENSG00000234078    1007_s_at        DDR1 HSCHR6_MHC_MANN_CTG1                  780        2191217      2210394 19177
2 ENSG00000137332    1007_s_at        DDR1  HSCHR6_MHC_COX_CTG1                  780        2360744      2379927 19183
3 ENSG00000230456    1007_s_at        DDR1  HSCHR6_MHC_DBB_CTG1                  780        2137270      2156466 19196
4 ENSG00000215522    1007_s_at        DDR1  HSCHR6_MHC_QBL_CTG1                  780        2136143      2155326 19183
5 ENSG00000204580    1007_s_at        DDR1                    6 p21.33           780       30876421     30900156 23735