I have a GFF3 file which i have imported into R. I have created a column called "gene_name" where the name (or the annotation) is located. I would like to group these based on the type of gene/protein. For example, I have a lot of different types of transposases, which can all be grouped into "transposases" (i.e. create a new column with the value "transposases" for each respective row). Another example would be to group all virulence genes into a group called "virulence". Since I can have several thousands of different gene names makes it difficult to do manually in R. Therefore, I was wondering if there exist a tool or function that can do this automatically?
Example data (very simplified with only two categories, original data may have 100 + different groups and 1000+ genes):
gene_name gene_group
IS3 family transposase IS629 transposase
IS3 family transposase ISSen4 transposase
IS3 family transposase IS2 transposase
Aerobactin synthase virulence
Ferric aerobactin receptor virulence
I appreciate any input!
The question is not entirely clear because it's not clear if the grouping information is already in the data or not. If not the question could be about how to get this information. If the grouping information is already in the table, then this is simply an R programming question. Check for example the group_by function of the dplyr package or the package data.table.