Question

FASTA file with taxid, want to obtain lineage and put into header

0

Entering edit mode

6.6 years ago

mariahaguiar001 • 0

Hello all,

I have a fasta file with sequences labelled with their NCBI taxid, it looks like:

>669613 
GACTTAATTGG....
>15371
GACTTGATTGTATTGAGCCTT...

I want to get the lineage for it to obtain a header like:

>669613;tax=d:Eukaryota,p:Tracheophyta,c:Magnoliopsida,o:Polygonales,f:Polygonaceae
GACTTAATTGG....

Where d: domain, p: phylum, c: class, ect.

Is there some way to do this not manually? I have downloaded the NCBI taxonomy, the taxdump file.

FASTA taxonomy taxdump NCBI • 5.7k views

ADD COMMENT • link updated 6.6 years ago by Chris S. ▴ 340 • written 6.6 years ago by mariahaguiar001 • 0

4

Entering edit mode

6.6 years ago

Chirag Parsania ★ 2.0k

Here, I assume that you know bit of R programming and you can tweak R functions.

I wrote a R function to get NCBI kingdom names from ncbi taxonomy id. You can modify function to get various level of taxonomy and then create your own fasta header. In the given function below, you can define taxonomy level in the parent function to get the taxid of that level. For e.g. I asked for level kingdom id : 3rd argument --> (parent(id,taxdir,"kingdom",nodes)). Similarly you can ask for species, class, phylum etc. Once you the desired level taxid obtained, i use function sciname to get the taxname of given id.

#' Title : getKingdomNameFromNcbiTaxId
#' @description  : for a given vector of valid ncbi taxid retrieve corresponding kingdom id and kingdom name
#' @param inputTaxId : A vector of valid NCBI taxonomy ids 
#' @param n_threads  : A numeric value of maximumn number of threads used, default 3
#' @param ncbi_taxdir : Directory path containing ncbi taxonomy dump files
#'
#' @return : A data.frame with colanmes taxid, kingdom_id , sp_name
#' @export
#'
#' @examples

getKingdomNameFromNcbiTaxId <- function(inputTaxId , n_threads = 3 , ncbi_taxdir = "path/to/ncbi_taxonomy_files/"){

        ## load libraries 
        library("CHNOSZ")
        library("parallel")

        ## test variables 
         # inputTaxId <-  unique(mapped$staxid)
         # ncbi_taxdir = "/Users/chiragparsania/Documents/Projects/LiAng_Data/horizontal_gene_transfer/8_R/ncbi_taxonomy_files/"
         # n_threads = 6
        # 
        inputTaxId <- unique(inputTaxId)

        ## load NCBI taxonomy files 
        names <- getnames(taxdir = ncbi_taxdir)
        nodes <- getnodes(taxdir= ncbi_taxdir)

        ## parellal processing in mac 
        cl <- makeCluster(n_threads , type = "FORK")

        # iterate over each taxid 
        out <- parLapply(cl,inputTaxId, function(id){
                #id <- 2070412 ## test id 
                cat(paste("proc",id,"\n"))
                kingdom_id <- parent(id,taxdir,"kingdom",nodes)
                kingdom_name  <- sciname(kingdom_id,taxdir = taxdir,names = names)

        })
        stopCluster(cl)
        ## return object 
        final <- do.call("rbind",out)
        return(final)
}

ADD COMMENT • link 6.6 years ago by Chirag Parsania ★ 2.0k

score 2 · Accepted Answer · 2018-05-26

2

Entering edit mode

6.6 years ago

Chris S. ▴ 340

There is a rankedlineage.dmp file in the new_taxdump directory at NCBI that has these ranks.

wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz
tar -zxvf new_taxdump.tar.gz

Here's one way to format the file using R and tidyverse packages (and you'll still need to figure out how to replace the header). The dmp file is tab and pipe-separated, so skip the pipes using - in col_types.

library(tidyverse)  
tax <- read_tsv("rankedlineage.dmp", 
         col_names = c("id", "name", "s", "g", "f", "o", "c","p", "k", "d"), 
         col_types=("i-c-c-c-c-c-c-c-c-c-"))

Next, gather columns 2 to 10 into two columns with the column name and value, remove NAs, and unite into a single column.

x2 <- gather(tax, "key", "value", 2:10) %>%
         filter(!is.na(value)) %>% 
          unite(2:3, col="name", sep="=") 
x2
# A tibble: 11,697,321 x 2
        id name                                               
     <int> <chr>                                              
 1       1 name=root                                          
 2  131567 name=cellular organisms                            
 3    2157 name=Archaea                                       
 4 1935183 name=Asgard group

Finally, group by tax id to get the lineage (this groups 12 million rows, so it will take a while. I did not time it, but maybe 5-10 minutes?)

tax2 <-  group_by(x2, id) %>% 
      summarize(lineage= paste(name, collapse="; "))

filter(tax2, id %in% c(669613, 15371) )
# A tibble: 2 x 2
       id lineage
     <int> <chr>
  1  15371 name=Bromus inermis; g=Bromus; f=Poaceae; o=Poales; c=Liliopsida; p=Streptophyta; k=Viridiplantae; d=Eukaryota    
  2 669613 name=Koenigia alaskana; g=Koenigia; f=Polygonaceae; o=Caryophyllales; p=Streptophyta; k=Viridiplantae; d=Eukaryota

ADD COMMENT • link 6.6 years ago by Chris S. ▴ 340

0

Entering edit mode

Hi Chris,

Good to know this rankedlineage.dmp. It must be much faster than what I suggested.

ADD REPLY • link 6.6 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Thank you! This saved me heaps of time.

ADD REPLY • link 6.6 years ago by mariahaguiar001 • 0

0

Entering edit mode

Hi mariahaguiar001

I have the same challenge of replace by tax full lineage, but I have accesion id´s instead of ncbi taxid, Did you find a not manual way to replace the headers?

ADD REPLY • link 6.4 years ago by dllopezr ▴ 130

0

Entering edit mode

First you have to convert accession to NCBI tax id then you can get full lineage using following R function

#' Title
#'
#' @param ncbi_tax_ids : a vector of NCBI tax_ids
#' @param ncbi_tax_dir : directory having ncbi tax files 
#'
#' @return : a tibble having tax_id plus all lineage columns 
#' @export
#'
#' @examples

    tax_id_to_ranked_lineage <- function( ncbi_tax_ids , ncbi_tax_dir =  "dir_path/to/ncbi_taxonomy_files/"){
            library(data.table)
            library(tidyverse)
            ## get rankedlineage file index 
            ranked_lineage_file_index  <- grep("rankedlineage.dmp" , list.files(ncbi_tax_dir , full.names = T))

            ## check if the file present 
            if(length(ranked_lineage_file_index) != 1){
                    stop("rankedlineagess.dmp not found in given tax directory" )
            }

            ## get the file and load it 
            ranked_lineage_file <- list.files(ncbi_tax_dir , full.names = T)[ranked_lineage_file_index]
            ranked_lineage <- fread(ranked_lineage_file , data.table = F) %>% 
                    as_tibble() %>% 
                    select(seq(from = 1, to = 20 , by = 2)) %>% 
                    `colnames<-`(c("tax_id","tax_name","species","genus","family","order","class","phylum","kingdom","superkingdom"))

            ncbi_tax_ids <- as.numeric(ncbi_tax_ids) %>% as_tibble() %>% `colnames<-`(c("query_tax_id"))
            mapped <- ranked_lineage %>% 
                    right_join(ncbi_tax_ids , by = c("tax_id" = "query_tax_id"))

            return(mapped)
    }

ADD REPLY • link 6.4 years ago by Chirag Parsania ★ 2.0k

0

Entering edit mode

Hi Chris This is super helpful. I have .txt file with close to 1000 taxon ids How do I filter the x2 table (tibble ? sorry new to tidyverse) based on the list of taxon ids in another file? Thanks Ewelina

ADD REPLY • link 6.0 years ago by ewelinarubin ▴ 10