Hi everyone!
I have an issue extracting ensembl gene ids from a messy data frame. First, I loaded the csv file in R (file that was not separated by commas) and looks like:
> my_csv_file
ensembl_gene_id.entrezgene_id.hgnc_symbol.gene_biotype
1 1 ENSG00000174365 128439 SNHG11 lncRNA
2 2 ENSG00000180385 NA EMC3-AS1 transcribed_unprocessed_pseudogene
3 3 ENSG00000183562 NA lncRNA
4 4 ENSG00000205266 NA KRT17P5 transcribed_unprocessed_pseudogene
5 5 ENSG00000206585 26864 RNVU1-7 snRNA
6 6 ENSG00000206588 NA RNU1-28P snRNA
Then, I tried to extract the ensembl gene id from each row using sub function. For example, for row number 1:
> sub("^\\d", "", my_csv_file[1, ]
[1] " ENSG00000174365 128439 SNHG11 lncRNA"
However, I'm stuck because I donĀ“t know how to remove the alphanumeric characters after the ensembl id by using regular expressions and then put it inside a for loop.
I appreciate your help.
Best regards.
Exactly, I want to keep the ensembl id's from the original df.