Hi guys,
I am quite new to R and currently doing RNA-seq and cleaning my data frame. This post might be a little long but I would really appreciate it if someone with more experience could help me with this.
I am working on a non-model organism and got my expression data with Ballgown already. Because it is a non-model organism, many genes don't have proper namings, in the gff annotation files, all of the gene IDs start with "gene=LOCxxxxx (xxxx are numetic numbers)", and the "product=" represents the function/product of the gene.
e.g.,
Column1 Column2 Column3
gene=LOC110373733 partial=true product=myotubularin-related protein 2
gbkey=CDS gene=LOC110373733 partial=true
I am trying to clean up the gff file containing only two columns that has "Column A: contains only the LOC=xxxxx" and "Column B: contains the producut=" information, so I can ask R to cross compare the LOCxxxx ID from the cleaned-up annotation file with the expression data and display the product, to give me an idea of what sort of genes are differentially expressed.
As the example displayed here, after cleaning up a bit it is still currently messy (information displayed in different columns). The approaches I am trying in order to clean up the data that I have is
- use mutate() and grep() to first extract out the LOCxxxx into new columns from column1 and columns2 and merge them together, then do the same for product=
I have tried to run
mutate(df, column2 = grep("LOC.*", df$info4, value = TRUE)
but it returned an error message "x column2
must be size 100 or 1, not 53."
I think it is because some of the rows do not actually have the LOC in there (i.e., gbkey=CDS and partial=true) hence it neglected these and just created a string.
Is there a way I can make the rows that do not contain the LOC blank or NA so I can remove it after this step and merge two columns later?
I have been cleaning this data frame for a while and googling, but none of the answers really fit this purpose. So I am trying to break it down into small steps and cleaning it up slowly.
Thanks again!
Cheers, Grace
Edit:
I realised the format got a bit messed up. Let me put it here...
mutate(df, columnA = grep("LOC.*", df$Column1, value = TRUE)
Post is confusing and TL;DR. Please post expected output.
OP, from what I gather, you seem to be on the right track with grep and mutate to manipulate the values, but cpad0112 is correct, your question is not clear. If you can create a df with the values that you currently have (and copy the code to create it here), and then show what you want as output, then you will get a better response.
Thank you, and sorry for the confusion, I have now added the dataset and the expected output.