Question

What is the difference between agilent_wholegenome and agilent_wholegenome_4x44k_v1 in biomart?

0

Entering edit mode

20 months ago

solarchan7 • 0

I am trying to convert some agilent ids into ensembl gene id with biomaRt() in R, but i realize there are three groups for agilent ids: agilent_wholegenome, agilent_wholegenome_4x44k_v1, and agilent_wholegenome_4x44k_v2.

The agilent website only explained what 4x44 is but not the differences between them: https://www.agilent.com/en/product/cgh-cgh-snp-microarray-platform/cgh-cgh-snp-microarrays/human-microarrays/human-genome-cgh-microarray-kit-4x44k-228410

Which one is more appropriate?

Thank you

genetics biomaRt gene • 765 views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 20 months ago by solarchan7 • 0

Ram · Accepted Answer · 2023-03-20

We can take a look at the content of those three attributes to try and figure this out. Here's some code to extract each set of Agilent IDs:

library(biomaRt)
human.mart <- useEnsembl(biomart = "genes", dataset="hsapiens_gene_ensembl")

## get the three sets of Agilent IDs
wg <- getBM(attributes = "agilent_wholegenome", mart = human.mart)
wg44v1 <- getBM(attributes = "agilent_wholegenome_4x44k_v1", mart = human.mart)
wg44v2 <- getBM(attributes = "agilent_wholegenome_4x44k_v2", mart = human.mart)

Now we can compare the IDs to see if there's any overlap:

table(wg$agilent_wholegenome %in% wg44v1$agilent_wholegenome_4x44k_v1)
#> 
#>  TRUE 
#> 32263

table(wg$agilent_wholegenome %in% wg44v1$agilent_wholegenome_4x44k_v2)
#> 
#> FALSE 
#> 32263

The above output indicates that the agilent_wholegenome and agilent_wholegenome_4x44k_v1 attributes are identical, so you can probably use either one if you have V1 arrays. I'm not sure why this is duplicated in BioMart. On the other hand, the V2 IDs have no overlap with V1.

Ideally the array data you're using wold be accompanied by metdata to help you know if it was V1 or V2. If you can't find that, since the output above indicates that there's no overlapping IDs between the two versions, perhaps just looking at the IDs you do have will be sufficient to identify the version you're working with i.e. here's the first ten IDs from each platform. If you can see any of these in your data, you know what you're working with.

head(wg44v1$agilent_wholegenome_4x44k_v1)
#> [1] "A_24_P179339" "A_24_P42453"  "A_24_P179336" "A_23_P331028" "A_24_P182122"
#> [6] "A_23_P431853"
head(wg44v2$agilent_wholegenome_4x44k_v2)
#> [1] "A_24_P182122"  "A_23_P431853"  "A_23_P402751"  "A_33_P3410700"
#> [5] "A_23_P301925"  "A_23_P337726"