Hi
I have some difficulties splitting my taxonomy column into different rank, i.e."domain", "phylum", "class", "order", "family", "genus" .
The biggest problem is that the format in my taxonomy column is not uniform. Some of them have complete taxonomy levels, while others only have “domain”、“phylum”、“genus”levels.
My data has a few thousand rows and which looks something like this :
OTUID Taxonomy
OTU1 d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:Pseudomonadales,f:Pseudomonadaceae,g:Pseudomonas
OTU20 d:Archaea,p:"Thaumarchaeota",o:Nitrososphaerales,f:Nitrososphaeraceae,g:Nitrososphaera
OTU774 d:Bacteria,p:"Armatimonadetes",g:Armatimonadetes_gp4
I'm not familiar with R, so I've been searching for relevant solutions on the Internet a whole day, and I've tried the separate function in the tidyr package, like this
library(tidyr)
x <- read.csv("annotation.csv")
y <- x %>% separate(Taxonomy, c("domain", "phylum", "class", "order", "family", "genus"), ",[a-z]:")
write.csv(y,"tax_split.csv",row.names = TRUE)
But the result let me down. This can't split my taxonomy according to different ranks.
OTUID domain phylum class order family genus
OTU1 d:Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas
OTU20 d:Archaea "Thaumarchaeota" Nitrososphaerales Nitrososphaeraceae Nitrososphaera NA
OTU774 d:Bacteria "Armatimonadetes" Armatimonadetes_gp4 NA NA NA
Finally, I have to use the excel filtering function to deal with this, but this method is very time-consuming(╯︵╰) I still want to ask, is there any elegant way to use R to solve this problem?
Thanks for your help!
Thank you for taking the time to try to solve this problem. I got a lot of help. Just as you said, I tried the code step by step. And I want to make one that doesn't discard NA,I try to modify your last line of code,like
x <- dcast(x[, .(OTUID, id, type)], OTUID ~ id, value.var = "type")
,but the R always reports an error. I think thex[!is.na(id), .(OTUID, id, type)]
is to select data to function dcast, and I also tryx <- dcast( x[, c("OTUID", "id", "type")], OTUID ~ id, value.var = "type")
,but it failed again. How to make the right change? Did I get the wrong understanding? Thanks again~It is good practice to specify the exact error message when you encounter an error. If the message has sensitive information, mask it with some placeholder that makes sense, but just saying "I see an error" does not help us figure out what could be going on.