Part of your problem: the XML returned by the genbank() method is different to that returned by Pierre's EUtils query and does not, in fact, contain the tag sig_peptide.
If we do:
top <- xmlRoot(answer)
getNodeSet(top, "//GBSeq_feature-table")
We see that no nodes are returned:
list()
attr(,"class")
[1] "XMLNodeSet"
Whereas if we try:
feats <- getNodeSet(top, "//Seq-feat")
61 nodes are returned:
length(feats)
[1] 61
In fact, the XML returned by genbank() is the same as that obtained by visiting the NCBI web page for the sequence and sending the file to XML. Examination of that file using e.g. grep:
grep sig_peptide Downloads/sequence.xml
# no matches
shows that this XML does not contain sig_peptide.
So first part of the answer: do not use the genbank() method from library(annotate). To get the correct XML, containing sig_peptide, use:
library(RCurl)
xml <- getURL("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_000610&rettype=gb&retmode=xml")
doc <- xmlTreeParse(xml, useInternalNodes = T)
Now, we can get the features from that XML file into a list:
top <- xmlRoot(doc)
feats <- getNodeSet(top, "//GBSeq_feature-table//GBFeature")
length(feats)
# [1] 54
And then use lapply and xmlSApply to extract values for features:
values <- lapply(feats, function(x) xmlSApply(x, xmlValue))
Examination of this new list shows the sig_peptide is element 6:
names(values[[6]])
# [1] "GBFeature_key" "GBFeature_location" "GBFeature_intervals"
# [4] "GBFeature_quals"
values[[6]]["GBFeature_key"]
# GBFeature_key
# "sig_peptide"
values[[6]]["GBFeature_location"]
# GBFeature_location
# "435..494"
You can also get the feature nodes into a data frame:
df1 <- xmlToDataFrame(feats)
And grep for "sig_peptide":
df1[grep("sig_", df1$GBFeature_key), 1:2]
# GBFeature_key GBFeature_location
# 6 sig_peptide 435..494
Final hint: the R XML documentation is dreadful, so don't despair when trying to learn it. It's very difficult for the experienced, never mind beginners.
Thanks for the details and the leg work. All very helpful!
Cool; FWIW,
xmlInternalTreeParse("http://...")
works without RCurl. The xpath querynode <- getNodeSet(xml, '//GBFeature[GBFeature_key/text()="sig_peptide"]')[[1]]
selects the relevant node, andxpathApply(node, ".//GBQualifier_name")
is the syntax for querying the content of node.Thanks for the tips Martin.