Create a new column by a partial string pattern
1
0
Entering edit mode
2.6 years ago
Grace • 0

Hi guys,

I am quite new to R and currently doing RNA-seq and cleaning my data frame. This post might be a little long but I would really appreciate it if someone with more experience could help me with this.

I am working on a non-model organism and got my expression data with Ballgown already. Because it is a non-model organism, many genes don't have proper namings, in the gff annotation files, all of the gene IDs start with "gene=LOCxxxxx (xxxx are numetic numbers)", and the "product=" represents the function/product of the gene.

e.g.,

Column1                                     Column2                       Column3
gene=LOC110373733                partial=true                    product=myotubularin-related protein 2
gbkey=CDS                                gene=LOC110373733   partial=true

I am trying to clean up the gff file containing only two columns that has "Column A: contains only the LOC=xxxxx" and "Column B: contains the producut=" information, so I can ask R to cross compare the LOCxxxx ID from the cleaned-up annotation file with the expression data and display the product, to give me an idea of what sort of genes are differentially expressed.

As the example displayed here, after cleaning up a bit it is still currently messy (information displayed in different columns). The approaches I am trying in order to clean up the data that I have is

  1. use mutate() and grep() to first extract out the LOCxxxx into new columns from column1 and columns2 and merge them together, then do the same for product=

I have tried to run

mutate(df, column2 = grep("LOC.*", df$info4, value = TRUE)

but it returned an error message "x column2 must be size 100 or 1, not 53." I think it is because some of the rows do not actually have the LOC in there (i.e., gbkey=CDS and partial=true) hence it neglected these and just created a string.

Is there a way I can make the rows that do not contain the LOC blank or NA so I can remove it after this step and merge two columns later?

I have been cleaning this data frame for a while and googling, but none of the answers really fit this purpose. So I am trying to break it down into small steps and cleaning it up slowly.

Thanks again!

Cheers, Grace

Edit:

I realised the format got a bit messed up. Let me put it here...

enter image description here

  mutate(df, columnA = grep("LOC.*", df$Column1, value = TRUE)
RNA-seq R • 2.7k views
ADD COMMENT
1
Entering edit mode

Post is confusing and TL;DR. Please post expected output.

ADD REPLY
1
Entering edit mode

OP, from what I gather, you seem to be on the right track with grep and mutate to manipulate the values, but cpad0112 is correct, your question is not clear. If you can create a df with the values that you currently have (and copy the code to create it here), and then show what you want as output, then you will get a better response.

ADD REPLY
0
Entering edit mode

Thank you, and sorry for the confusion, I have now added the dataset and the expected output.

ADD REPLY
0
Entering edit mode
2.6 years ago
Grace • 0

Hi,

I am really sorry about the confusion caused here, I tried to create a subset of the dataset hopefully to make the question more clearer. The subset is shown here as

Column1         Column2         Column3
gene=LOC11  product=A   partial=true
gene=LOC11  product=A   partial=true
gene=LOC22  product=B   partial=true
gene=LOC33  product=C   partial=true
gbkey=CDS   gene=LOC44  product=D
gbkey=CDS   gene=LOC55  product=E
gbkey=CDS   gene=LOC66  product=F

The final columns I want are

ColumnA ColumnB
gene=LOC11  product=A
gene=LOC11  product=A
gene=LOC22  product=B
gene=LOC33  product=C
gene=LOC44  product=D
gene=LOC55  product=E
gene=LOC66  product=F

I wanted to try use mutate() and grep() to create new columns and then later merge them together kind of like this

Column new1 Column new2
gene=LOC11  NA
gene=LOC11  NA
gene=LOC22  NA
gene=LOC33  NA
NA  gene=LOC44
NA  gene=LOC55
NA  gene=LOC66

Then merge these column new 1 and new 2 to make ColumnA, and do the same thing for ColumnB. I have tried

df %>% mutate(ColumnA = grep("LOC.*", df$Column1, value = TRUE))

and gives me an error message of "x ColumnA must be size 100 or 1, not 53." (100 is the size of my dataframe..)

Hopefully this makes it a bit clearer...

Cheers, Grae

ADD COMMENT
0
Entering edit mode

I think this is xy problem. With example data above and expected output, try the following code:

> df
     Column1    Column2      Column3
1 gene=LOC11  product=A partial=true
2 gene=LOC11  product=A partial=true
3 gene=LOC22  product=B partial=true
4 gene=LOC33  product=C partial=true
5  gbkey=CDS gene=LOC44    product=D
6  gbkey=CDS gene=LOC55    product=E
7  gbkey=CDS gene=LOC66    product=F
> library(dplyr)

If you want old columns to be edited:

> df %>%
+   select(1,2) %>% 
+   mutate(across(.cols = everything(), ~ifelse(grepl("gene",.), .,"NA")))
     Column1    Column2
1 gene=LOC11         NA
2 gene=LOC11         NA
3 gene=LOC22         NA
4 gene=LOC33         NA
5         NA gene=LOC44
6         NA gene=LOC55
7         NA gene=LOC66

if you want new columns to be created:

> df %>%
+   select(1,2) %>% 
+   mutate(across(.cols = everything(), 
+                 ~ifelse(grepl("gene",.), .,"NA"), 
+                 .names = "new_{col}" ))
     Column1    Column2 new_Column1 new_Column2
1 gene=LOC11  product=A  gene=LOC11          NA
2 gene=LOC11  product=A  gene=LOC11          NA
3 gene=LOC22  product=B  gene=LOC22          NA
4 gene=LOC33  product=C  gene=LOC33          NA
5  gbkey=CDS gene=LOC44          NA  gene=LOC44
6  gbkey=CDS gene=LOC55          NA  gene=LOC55
7  gbkey=CDS gene=LOC66          NA  gene=LOC66
ADD REPLY
0
Entering edit mode

Thanks I have tried this approach and it worked! I will look up the across()argument and allow the code to make sense to me :).

ADD REPLY
0
Entering edit mode

In your starting file, does gene=LOCXX always appear in either column 1 or column 2? How I would approach it is to split your dataframe into two (or more) depending on which column gene=LOCXX appears. You can then limit each dataframe to the two columns you want then rbind them back into a single dataframe.

df <- tibble(col1 = c("gene=LOC11","gene=LOC22","gene=LOC33","gbkey","gbkey","gbkey"),
             col2 = c("A","B","C","gene=LOC44","gene=LOC55","gene=LOC66"),
             col3 = c("partial","partial","partial","D","E","F"))

df1 <- df[grep("LOC",df$col1),] %>% select(1,2)
df2 <- df[grep("LOC",df$col2),] %>% select(2,3)
colnames(df2) <- colnames(df1)
rbind(df1, df2)

A tibble:6 x 2
col1
<chr>
col2
<chr>
gene=LOC11  A           
gene=LOC22  B           
gene=LOC33  C           
gene=LOC44  D           
gene=LOC55  E           
gene=LOC66  F

Just for your knowledge, the reason why your code is failing is because of this line:

mutate(df, columnA = grep("LOC.*", df$Column1, value = TRUE)

You are asking to add a column to your dataframe that is smaller than your dataframe and mutate doesn't know how to handle that. I think you're trying to make a column with boolean values. If that's the case, I'd look into the case_when function:

df %>% mutate(ColA = case_when(str_detect(col1, "LOC") == TRUE ~ 1,
                               str_detect(col1, "LOC") == FALSE ~ 0))

This can be modified to add the "gene=LOCXX" as that column like this:

df %>% mutate(ColA = case_when(str_detect(col1, "LOC") == TRUE ~ col1,
                               str_detect(col2, "LOC") == TRUE ~ col2))

col1
<chr>
col2
<chr>
col3
<chr>
ColA
<chr>
gene=LOC11  A   partial gene=LOC11  
gene=LOC22  B   partial gene=LOC22  
gene=LOC33  C   partial gene=LOC33  
gbkey   gene=LOC44  D   gene=LOC44  
gbkey   gene=LOC55  E   gene=LOC55  
gbkey   gene=LOC66  F   gene=LOC66
ADD REPLY
0
Entering edit mode

I like your approach! It actually has arguments that I understand...I am quite new to R, if the code has too many regular expressions then I sort of need a long time to figure out what it means! And yes it worked!

ADD REPLY

Login before adding your answer.

Traffic: 1963 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6