Question

Non matched HG-U219 microarray probe sets.

0

Entering edit mode

7.8 years ago

arronar ▴ 290

Hello.

I'm currently trying to convert the probsets into symbols and entrez ids of a HG_U219 microarray by using the following commands

Symbols = unlist(mget(probes, hgu219SYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hgu219ENTREZID, ifnotfound=NA))

As you know, some of the probe sets can not match into a specific symbol and get a 'NA' value.

In the case of HG_U133_plus2 microarray I gather those NA probesets and try to match them using the gconvert tool.

The thing now is that neither gconvert nor biomart have a selection for the HG-U219 and thus I don't know what to do with those not matched probe sets.

What I tried so far is to choose as target database in gconvert the AFFY_HG_U133_PLUS_2 that returns some gene symbols for the not found (NAs) from the HG_U219 probesets.

Is it reasonable to use those symbols that occurred by using target database the AFFY_HG_U133_PLUS_2 or is it better not to take them into account at all ?

Thank you.

microarray probesets annotation • 2.9k views

ADD COMMENT • link updated 7.8 years ago by Kevin Blighe 89k • written 7.8 years ago by arronar ▴ 290

score 4 · Accepted Answer · 2018-01-05

My honest recommendation to you, coming from a person who has analysed dozens of different types of microarrays, is to use the annotation provided by the manufacturer (in this case, Affymetrix), and to do the annotation conversion manually. The programs that attempt to do automated conversion between annotations are frequently out of date and cumbersome to use, as you have found. The manufacturer's annotation is always the most comprehensive and most updated.

The exact file that you need for the HG-U219 is here: http://www.affymetrix.com/support/technical/byproduct.affx?product=HG-U219

Look for 'HG-U219 Annotations, CSV format, Release 36 (38 MB, 4/13/16)' - you may need to register in order to download it.

----------------------

The extracted ZIP file is large but it loads into R, where you can easily do the mappings using match() or which(). These annotation files have headers that start with hash (#), like this:

##For information about the Annotation file content
#%create_date=2016-03-30 GMT-08:00 16:43:06
#%chip_type=HG-U219
#%genome-species=Homo sapiens
#%genome-version=hg19
#%genome-version-ucsc=hg19
#%genome-version-ncbi=GRCh37
#%genome-version-create_date=2009-02-00
#%ensembl-date=2015-11-11
#%ensembl-version=82
...

The remainder is then 'shockingly' comprehensive. Here is just a snapshot:

Probe Set ID    UniGene ID  Gene Title              Gene Symbol Location    Ensembl
11715100_at     Hs.247813   histone cluster 1, H3g  HIST1H3G    chr6p22.2   ENSG00000273983
11715101_s_at   Hs.247813   histone cluster 1, H3g  HIST1H3G    chr6p22.2   ENSG00000273983
11715102_x_at   Hs.247813   histone cluster 1, H3g  HIST1H3G    chr6p22.2   ENSG00000273983
11715103_x_at   Hs.465643   tumor necrosis factor   TNFAIP8L1   chr19p13.3  ENSG00000185361
11715104_s_at   Hs.352515   otopetrin 2             OTOP2       chr17q25.1  ENSG00000183034
11715105_at     Hs.439154   chr17 ORF 78            C17orf78    chr17q12    ENSG00000278145
11715106_x_at   Hs.450233   CTAGE family, member 15 CTAGE15     chr7q35     ENSG00000271079
11715107_s_at   Hs.533543   coag. factor VIII       F8A1        chrXq28     ENSG00000274791
11715108_x_at   Hs.722466   linc RNA 1098           LINC01098   chr4q34.3   ENSG00000231171
11715109_at     Hs.439922   sterile ... cont. 7     SAMD7       chr3q26.2   ENSG00000187033
11715110_at     Hs.574574   arrestin domain cont. 5 ARRDC5      chr19p13.3  ENSG00000205784
11715111_s_at   Hs.172944   chorionic gonado., beta CGB         chr19q13.32 ENSG00000104818
11715112_at     Hs.531182   glutamate rich 3        ERICH3      chr1p31.1   ENSG00000178965
11715113_x_at   Hs.567527   fam 86, member C1       FAM86C1     chr11q13.4  ENSG00000158483
11715114_x_at   Hs.567527   fam 86, member C1       FAM86C1     chr11q13.4  ENSG00000158483

...

Here is a list of all columns in the annotation:

Probe Set ID
GeneChip Array
Species Scientific Name
Annotation Date
Sequence Type
Sequence Source
Transcript ID(Array Design)
Target Description
Representative Public ID
Archival UniGene Cluster
UniGene ID
Genome Version
Alignments
Gene Title
Gene Symbol
Chromosomal Location
Unigene Cluster Type
Ensembl
Entrez Gene
SwissProt
EC
OMIM
RefSeq Protein ID
RefSeq Transcript ID
FlyBase
AGI
WormBase
MGI Name
RGD Name
SGD accession number
Gene Ontology Biological Process
Gene Ontology Cellular Component
Gene Ontology Molecular Function
Pathway
InterPro
Trans Membrane
QTL
Annotation Description
Annotation Transcript Cluster
Transcript Assignments
Annotation Notes

Some probes still won't have any gene symbol (on this array, seems to be ~400, with 100 being control probes), but you can impute these with values from another column manually (e.g. Representative Public ID), preferably within the confines of R and not Excel, Excel for Mac, Libre/Open Office, or some other spreadsheet tool.

Kevin