Extracting gene symbols from gene assignments in exon array data
0
0
Entering edit mode
5.3 years ago
Kim ▴ 20

Hello everyone

I'm working on gene expression data from a human exon array. I want to have a column of gene symbols but the only column giving me that information is "gene assignment" and the information looks like this.

NM_001156474 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// NM_021827 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000445632 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000354755 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// BC126412 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000278487 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494

I would like to extract gene symbols from this (CCDC81 in this case). Does anyone know how I can do that in R?

Thank you very much

gene symbol gene assignment microarray exon • 1.3k views
ADD COMMENT
0
Entering edit mode

Have you tried the strsplit function in R?

ADD REPLY
0
Entering edit mode

Yes I'm trying to use strsplit but this function works with vector and the "gene assignment" data type is factor so it makes the work not straightforward.

ADD REPLY
1
Entering edit mode

It's hard to propose help when your problem is not completely described in the original question. The following works for me, could it be adapted to your data?

   > a <- as.factor("NM_001156474 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// NM_021827 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000445632 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000354755 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// BC126412 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000278487 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494")
    > strsplit(as.character(a), " // ")[[1]][2]
    [1] "CCDC81"
ADD REPLY
0
Entering edit mode

Hi Russ

I tried this command and it works. Thank you :)

for (i in 1:11005) { Gene_symbol[i] <- strsplit(full_table$gene_assignment, " // ")[[i]][2] }

ADD REPLY
1
Entering edit mode

You can avoid confusion due to factors with read.table(..., stringsAsFactors = FALSE) or data.table's fread (stringsAsFactors = FALSE by default). In case you hear otherwise, overriding R's defaults to set this as FALSE globally for each session will only cause you pain in the future, but it's fine for reading files in.

EDIT: if this doesn't work for you since you're talking about another data type, try coercing to a character vector first.

ADD REPLY

Login before adding your answer.

Traffic: 1520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6