Question

Extracting ensembl gene id from messy data frames

0

Entering edit mode

4.7 years ago

rodolfo.peacewalker ▴ 390

Hi everyone!

I have an issue extracting ensembl gene ids from a messy data frame. First, I loaded the csv file in R (file that was not separated by commas) and looks like:

> my_csv_file
               ensembl_gene_id.entrezgene_id.hgnc_symbol.gene_biotype
1                           1 ENSG00000174365 128439 SNHG11 lncRNA
2 2 ENSG00000180385 NA EMC3-AS1 transcribed_unprocessed_pseudogene
3                                     3 ENSG00000183562 NA  lncRNA
4  4 ENSG00000205266 NA KRT17P5 transcribed_unprocessed_pseudogene
5                            5 ENSG00000206585 26864 RNVU1-7 snRNA
6                              6 ENSG00000206588 NA RNU1-28P snRNA

Then, I tried to extract the ensembl gene id from each row using sub function. For example, for row number 1:

> sub("^\\d", "", my_csv_file[1, ]
[1] " ENSG00000174365 128439 SNHG11 lncRNA"

However, I'm stuck because I don´t know how to remove the alphanumeric characters after the ensembl id by using regular expressions and then put it inside a for loop.

I appreciate your help.

Best regards.

R RNA-Seq ChIP-Seq • 1.1k views

ADD COMMENT • link 4.7 years ago by rodolfo.peacewalker ▴ 390

score 1 · Answer 1 · 2020-12-18

1

Entering edit mode

4.7 years ago

ATpoint 89k

So the question with this example would be how to keep only ENSG00000174365 when there are whitespaces all over the place?

foo <- "  ENSG00000174365 128439 SNHG11 lncRNA"
gsub("\\ .*", "", trimws(x = foo, which = "left"))

Please give a reproducible example using dput().

ADD COMMENT • link 4.7 years ago by ATpoint 89k

0

Entering edit mode

Exactly, I want to keep the ensembl id's from the original df.

ADD REPLY • link 4.7 years ago by rodolfo.peacewalker ▴ 390

score 1 · Answer 2 · 2020-12-18

1

Entering edit mode

4.7 years ago

Ram 45k

Your "csv" file is space separated. It might be easy to just re-import with sep=" ".