Question

Extracting gene symbols from gene assignments in exon array data

0

Entering edit mode

5.5 years ago

Kim ▴ 20

Hello everyone

I'm working on gene expression data from a human exon array. I want to have a column of gene symbols but the only column giving me that information is "gene assignment" and the information looks like this.

NM_001156474 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// NM_021827 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000445632 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000354755 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// BC126412 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000278487 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494

I would like to extract gene symbols from this (CCDC81 in this case). Does anyone know how I can do that in R?

Thank you very much

gene symbol gene assignment microarray exon • 1.4k views

ADD COMMENT • link updated 5.5 years ago by Pierre Lindenbaum 165k • written 5.5 years ago by Kim ▴ 20

0

Entering edit mode

Have you tried the strsplit function in R?

ADD REPLY • link 5.5 years ago by Russ ▴ 520

0

Entering edit mode

Yes I'm trying to use strsplit but this function works with vector and the "gene assignment" data type is factor so it makes the work not straightforward.

ADD REPLY • link 5.5 years ago by Kim ▴ 20

1

Entering edit mode

It's hard to propose help when your problem is not completely described in the original question. The following works for me, could it be adapted to your data?

   > a <- as.factor("NM_001156474 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// NM_021827 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000445632 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000354755 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// BC126412 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000278487 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494")
    > strsplit(as.character(a), " // ")[[1]][2]
    [1] "CCDC81"

ADD REPLY • link 5.5 years ago by Russ ▴ 520

0

Entering edit mode

Hi Russ

I tried this command and it works. Thank you :)

for (i in 1:11005) { Gene_symbol[i] <- strsplit(full_table$gene_assignment, " // ")[[i]][2] }

ADD REPLY • link 5.5 years ago by Kim ▴ 20

1

Entering edit mode

You can avoid confusion due to factors with read.table(..., stringsAsFactors = FALSE) or data.table's fread (stringsAsFactors = FALSE by default). In case you hear otherwise, overriding R's defaults to set this as FALSE globally for each session will only cause you pain in the future, but it's fine for reading files in.

EDIT: if this doesn't work for you since you're talking about another data type, try coercing to a character vector first.

ADD REPLY • link 5.5 years ago by Brice Sarver ★ 3.8k