Question

Extracting strings from the fasta header

1

Entering edit mode

5.2 years ago

lokraj2003 ▴ 120

I want to extract gene name , gene start position and gene stop position from the fasta header of the fasta file. I have tried to extract based on the position but those locations are not consistent. Is there any other way to extract them ?

This is what I have tried so far.

    #I have a vector of these file names. Here I have just one element


   names1 =>"lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687] 
[protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]"

     #Then I extracted words from the string list 

     string_list1 <-  str_extract_all(names1, boundary("word"))

            #result 
            string_list1[1]



            [[1]]
             [1] "lcl"                           "NC_005336.1_cds_NP_957781.1_1"
             [3] "locus_tag"                     "ORFVgORF001"                  
             [5] "db_xref"                       "GeneID"                       
             [7] "2947687"                       "protein"                      
             [9] "ORF001"                        "hypothetical"                 
            [11] "protein"                       "protein_id"                   
            [13] "NP_957781.1"                   "location"                     
            [15] "complement"                    "3162"                         
            [17] "3611"                          "gbkey"                        
            [19] "CDS"

So, I was trying to extract 4th ,16th and 17th element from this list. It works for this particular example. This does not work for other headers where these positions are different. Usually, gene name is consistently present at the 4th position. But, the start and stop location differ among the fasta headers. So, this strategy is not working and I can't think of any other strategy.

fasta R string • 1.8k views

ADD COMMENT • link updated 5.2 years ago by Alex Nesmelov ▴ 200 • written 5.2 years ago by lokraj2003 ▴ 120

0

Entering edit mode

Split/focus on the actual keys like locus-tag or location=compliment if those are consistent. This might require regular expressions

ADD REPLY • link 5.2 years ago by curious ▴ 890

score 3 · Accepted Answer · 2020-06-16

3

Entering edit mode

5.2 years ago

Alex Nesmelov ▴ 200

If gene name is like [locus_tag=gene_name] and coordinates like [location=complement(3162..3611)]

library(tidyverse)

names1 <- "lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687][protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]"

(res <-  
 str_replace_all(names1,
              "^.*?locus_tag=(.*?)\\].*?\\[location.*?(\\d+)\\.\\.(\\d+).*?$",
              "\\1___\\2___\\3") %>%
 str_split("___")
)

If names will be a "gene_name" column in a data.frame called df, a clean final table can be easily produced:

df %>% 
mutate(gene_name =  str_replace_all(gene_name,
                                   "^.*?locus_tag=(.*?)\\].*?\\[location.*?(\\d+)\\.\\.(\\d+).*?$",
                                   "\\1___\\2___\\3")) %>% 
separate(gene_name,
         sep="___",
         into = c("gene", "start", "end"))

ADD COMMENT • link 5.2 years ago by Alex Nesmelov ▴ 200

0

Entering edit mode

Awesome. It works. Actually I have my gene names in the column of a data frame, so this is perfect. Would you mind telling me briefly what these regular expressions are doing? Thanks again for taking your time!

ADD REPLY • link 5.2 years ago by lokraj2003 ▴ 120

0

Entering edit mode

We are replacing whole string by three values of interest which are matched via parentheses and referred in replacement as \1, \2 \3. The trick is to match somehow a whole string to get rid of it.

^.*?locus_tag= ------ anything from the start ^ up to locus_tag=, including it. This part is matched for replacement and then will be deleted.
(.*?)\] --------- anything after locus_tag= up to the next square bracket. It is a gene name and its extracted using paranteses.
.?\[location.?(\d+) -------- anything up to "[location" and after it up to the number consisting of more than one digits (\d+). Number is extracted as gene start via parentheses, other matched parts will be removed.
\.\. ------ two points separating gene coordinates
.*?$ -------- anything up to the end of string $.

ADD REPLY • link 5.2 years ago by Alex Nesmelov ▴ 200

score 2 · Accepted Answer · 2020-06-16

Here is the start:

# example data
x <- c("lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687] [protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]",
       "lcl|NC_001111_NP_999_1 [locus_tag=Test001] [db_xref=GeneID:2947687] [protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [gbkey=CDS]")

f1 <- function(x, pattern){
  lapply(strsplit(x, " "), function(i){
    grep(pattern, i, value = TRUE)
  })
  }

f1(x, "locus_tag")
# [[1]]
# [1] "[locus_tag=ORFVgORF001]"
# 
# [[2]]
# [1] "[locus_tag=Test001]"
f1(x, "location")
# [[1]]
# [1] "[location=complement(3162..3611)]"
# 
# [[2]]
# character(0)