So I have a gziped test file that looks like this:
##ColumnVariables[gene_id]
##ParemeterValue[genome_assembly]=hg19
##ColumnVariables[column_1]
##ColumnVariables[column_2]
AAA 10 48
BBB 3 99
My code for opening the file looks like this:
# Specify the filename
gzip.file <- "testfile.txt.gz"
# Read the data on a dataframe.
# If we open the file this way, the first row is going to be the filename.
# So delete it.
gzip.df <- read.table(gzfile(gzip.file), header=F, fill=T, comment.char = '!')
gzip.df <- gzip.df[-1,]
And what I want to do is use regex to extract all the column_*
store them in a vector, delete them from the dataframe, and assign them as colnames
My problem is that since I'm an R newbie and at this point I'm stuck at reading the dataframe line-by-line
Here's my code:
col.names <- c()
for (i in 1:nrow(gzip.df)) {
#paste (gzip.df[i,])
if (regexpr('\\[(.*)\\]', gzip.df[i,1])) {
paste("HI")
col <- regmatches(test.str, regexpr('\\[(.*)\\]', gzip.df[i,1]))
col.names <- c(col.names, gzip.df[i,1])
}
}
So basically I want to read it line-by-line and while my regex is True
it's a column name and store it. But it never get's in the if-scope
why is that happening?
FYI, this is my 3rd R-script, I decided to move from Python
Use Unix command-line tools to parse the file into a headered text file, where the first row consists of column headers. Then reading in this modified file into R is trivial with
read.table()
orfread()
. Use R for its strengths, which do not include text parsing.I agree..... mainly because for loops are not handled well by R (they are very slow) ....preparing the files in the command line before R is a much simpler solution
Also, are the column names identical for all files? If so then just store the header as a string in R and add the string as the first file of every file you read in.
Yeah, not to be all chatty and stuff, but using R like this is like when my dad can't find a crescent wrench and so duct-tapes two flathead screwdrivers together. I guess it works.