I want to extract gene name , gene start position and gene stop position from the fasta header of the fasta file. I have tried to extract based on the position but those locations are not consistent. Is there any other way to extract them ?
This is what I have tried so far.
#I have a vector of these file names. Here I have just one element
names1 =>"lcl|NC_005336.1_cds_NP_957781.1_1 [locus_tag=ORFVgORF001] [db_xref=GeneID:2947687]
[protein=ORF001 hypothetical protein] [protein_id=NP_957781.1] [location=complement(3162..3611)] [gbkey=CDS]"
#Then I extracted words from the string list
string_list1 <- str_extract_all(names1, boundary("word"))
#result
string_list1[1]
[[1]]
[1] "lcl" "NC_005336.1_cds_NP_957781.1_1"
[3] "locus_tag" "ORFVgORF001"
[5] "db_xref" "GeneID"
[7] "2947687" "protein"
[9] "ORF001" "hypothetical"
[11] "protein" "protein_id"
[13] "NP_957781.1" "location"
[15] "complement" "3162"
[17] "3611" "gbkey"
[19] "CDS"
So, I was trying to extract 4th ,16th and 17th element from this list. It works for this particular example. This does not work for other headers where these positions are different. Usually, gene name is consistently present at the 4th position. But, the start and stop location differ among the fasta headers. So, this strategy is not working and I can't think of any other strategy.
Split/focus on the actual keys like
locus-tag
orlocation=compliment
if those are consistent. This might require regular expressions