How to append a data.frame in R with new data
3
0
Entering edit mode
7.8 years ago

So I am trying to execute the follow code on a list of ID's instead of an individual ID:

source("https://bioconductor.org/biocLite.R")
#install.packages('reutils')
#install.packages('Peptides')
#biocLite(pkgs = c('GenomeInfoDb','GenomicRanges'))
#install.packages('plyr')
#install.packages('devtools')
#devtools::install_github("gschofl/biofiles")
library(Peptides)
library(reutils)
library(Biostrings)
library(biofiles)
library(plyr)
library(stringr)
library(tibble)
#install.packages('data.table')
library(data.table)

#this exactly the end format of that data frame I want but instead of 1 UID like 124511 a list of UIDs 
fetch <- efetch(124511, db=db, rettype = 'gp', retmode = retmode, retmax = returnAmount)
rec <- gbRecord(fetch)
seq <- getSequence((ft(rec)))
m <- as.data.frame(seq)
setnames(m, "x", "sequence")
protienName <- names(seq)
m <- add_column(m, protienName, .after = 0)
m$molecularweight <- mw(m$sequence)
m$m<- str_count(m$sequence, 'm')
m$cc <- str_count(m$sequence, 'cc')
logvec <- grepl('(Protein)|(Region)', m$protienName)
m <- subset(m, logvec)

The problem is efetch() can only use one ID at a time. So I must either write a for loop or use the apply function on the list of protein IDs. If I were to take the code as is and tried to make it for a list each iteration would delete the previous one. Therefore I was hoping someone can help me append the data.frame each time or show me a way that each iteration wouldn't replace the previous.

R Genebank NCBI protiens • 3.1k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Wow...how about that. For some reason I really thought it couldn't! Thanks I will try just feeding it a list then. Thanks.

ADD REPLY
2
Entering edit mode
7.8 years ago

You could have a function called fetch_id() that returns a record for a given id:

fetch_id <- function(id) { return(efetch(id, db=db, rettype = 'gp', retmode = retmode, retmax = returnAmount)); }

Then, given a vector of IDs:

ids <- c(124511, 124512, 124513, ... )

you can pre-allocate a list:

l <- vector(mode = "list", length = length(ids))

Then iterate over ids to populate the list l with fetched results:

l <- lapply(ids, function(id) { fetch_id(id); })

Once you have this list object populated, you can run lapply() on it to run a function of your choice on each element of it:

process_fetched_record <- function(fr) {
    rec <- gbRecord(fr)
    seq <- getSequence((ft(rec)))
    m <- as.data.frame(seq)
    setnames(m, "x", "sequence")
    protienName <- names(seq)
    m <- add_column(m, protienName, .after = 0)
    m$molecularweight <- mw(m$sequence)
    m$m<- str_count(m$sequence, 'm')
    m$cc <- str_count(m$sequence, 'cc')
    logvec <- grepl('(Protein)|(Region)', m$protienName)
    m <- subset(m, logvec)
    return(m)
}

m <- lapply(l, function(fr) { process_fetched_record(fr); })

Then you have a list m that you can access by index.

ADD COMMENT
0
Entering edit mode
7.8 years ago

I'm not familiar with your code and I made a few adjustments to get it to run so hope it is still producing what you want...

But you asked for a loop that would iterate over your code without re-writing the output each time, hope this helps.

#Add list/data frame here (added test genes, also use 1:nrow(gene_list$gene) for data frame)
gene_list <- c("124511", "124512", "124513")

#create empty vector to collect input
list_collection <- NULL

#loop
for(gene in 1:length(gene_list)){

  #Select genes one by one
  geneid <- gene_list[gene]

  #Your code 
  fetch <- efetch(geneid, db="protein", rettype = 'gp', retmode = "text")
  rec <- gbRecord(fetch)
  seq <- getSequence((ft(rec)))
  m <- as.data.frame(seq)
  setnames(m, "x", "sequence")
  protienName <- names(seq)
  m <- add_column(m, protienName, .after = 0)
  m$molecularweight <- mw(m$sequence)
  m$m<- str_count(m$sequence, 'm')
  m$cc <- str_count(m$sequence, 'cc')
  logvec <- grepl('(Protein)|(Region)', m$protienName)
  m <- subset(m, logvec)

  #Add unique gene to the first column for downstream filtering 
  m <- data.frame(geneid, m) 

  #Collect information by row without re-writing 
  list_collection <- rbind(list_collection, m)
}
ADD COMMENT
1
Entering edit mode

Probably not a good idea to name a variable after a keyword (list).

ADD REPLY
0
Entering edit mode
7.8 years ago
zjhzwang ▴ 180

Maybe you can use dplyr::mutate, it can add new variables and preserve existing.

ADD COMMENT

Login before adding your answer.

Traffic: 1768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6