Question

How to use lapply to extract taxonomic information in my species list?

1

Entering edit mode

3.6 years ago

DNAngel ▴ 250

I have a file called species.txt which I edited to only contain a list of species from blast output. Literally just one column with species names which include genus and species separated by a space.

My goal is to just loop through that file (each line is a species), and obtain its taxonomy rankings using tax_name(species, get=c("superkingdom","kingdom","phylum","order","family","genus","species"), db="ncbi") Afterwards I would grab whichever taxonomy ranking I desire for each species and create some tables with that information.

Is there a simple way to do this in R without using a for loop? I think lapply would work but I am unable to figure out how to read in the file without having to write a for loop (which I am terrible at); and some files will have thousands of species in the list so a for loop would not be efficient in my opinion.

I have read in my file using readLines, but still unsure how to actually turn this into function with lapply. Is it possible to avoid a for loop here at all?

Thanks!

R • 1.4k views

ADD COMMENT • link 3.6 years ago by DNAngel ▴ 250

1

Entering edit mode

I can't quite tell all the steps of what you're trying to do, but to read in a 1 column file, and loop through the entries using lapply, you could do something like the following:

# read the file into a 1 column dataframe
species <- read.table(file="species.txt", sep="\t")

# call lapply to loop through all entries after converting the
# first column to a list, and define a custom function to carry out your procedure
lapply(as.list(species[,1]), function(x){
    # do all your stuff here. Each species name will be in a variable called: x
    tax_name(x, get=c("superkingdom","kingdom","phylum","order","family","genus","species"), db="ncbi")
    # do other stuff
})

Even though your file is NOT tab separated, set the separator so the space between genus and species will be ignored, and treated as one row per line.

ADD REPLY • link 3.6 years ago by seidel 11k

0

Entering edit mode

I tried that, but I get this error which I don't understand: Error: sci_com must be of class character, taxon_state But I don't get why this is an issue when I try to call the species from a list. If I type in their name individually instead of x, tax_name works fine. When I do str(species) it says it is a dataframe, factor with 52 levels.

ADD REPLY • link 3.6 years ago by DNAngel ▴ 250

0

Entering edit mode

Nevermind, just adding the as.character(x) in tax_name seemed to have worked to remove that error, however nothing is being printed. It just says that the species name is "found" and goes through them all but I don't see the lineage information popup as I was hoping to save that info separately.

ADD REPLY • link 3.6 years ago by DNAngel ▴ 250

1

Entering edit mode

Ok, so your using a package that contains an actual function called tax_name(). Looking this up, I see that it returns a data.frame. So if you want to capture that value, you could assign it to a variable and do something with it. Either process it further, or simply return it.

result <- tax_name(x, otherstuff)
return(result)

or

pseudocode:
  get lineage information from result
  return lineage information

Also, in the original call to read your species.txt file, you might add an argument which prevents values from being treated as factors:

species <- read.table(file="species.txt", sep="\t", as.is=TRUE)

ADD REPLY • link 3.6 years ago by seidel 11k

0

Entering edit mode

I ended up switching to python and writing my own code to grab the lines. There were too many issues with this package when working with bacteria sequences - a lot of the names just weren't able to be found and this package doesn't use taxids to search. Thanks though I think your approach would have worked had the extraction process actually worked for me with the names I had!

ADD REPLY • link 3.6 years ago by DNAngel ▴ 250