how to avoid R automatically converting strings to numbers
3
0
Entering edit mode
8.2 years ago
moxu ▴ 510

Suppose I have a tab delimited file as the following:

chr1 234 3.24
chr1 345 2.11
chr2 123 8.99
...
chrX 879 0.24
...

Then in R, I use "read.table" to read the file into a variable "d", the head of the "d" looks normal chr1 234 3.24 chr1 345 2.11 chr2 123 8.99 ...

But when I use "cbind(d[,1], d[,2], d[,3])" and assign it to another variable, say, "b", then "b" looks like

1 234 3.24
1 345 2.11
2 123 8.99
...
23 879 0.24 # "chrX" is automatically converted to "23"
...

That is odd. It looks like "cbind" treats characters as factors and used the factor numbers (e.g. 1, 2, ..., 23) to replace the strings (chr1, chr2, ..., chrX).

How to avoid this?

I know this might not be the best forum to ask the question, but since you guys are so great and I believe some of you have the answers!

R software error • 28k views
ADD COMMENT
1
Entering edit mode

try to use this site to answer your question

http://rseek.org/

ADD REPLY
0
Entering edit mode

since you guys are so great

You don't want to do the necessary "re"search on web?

ADD REPLY
0
Entering edit mode

I certainly did but found no answers. Weird.

ADD REPLY
0
Entering edit mode

So the problem is while reading the table, strings are read As Factors? Is that True?

ADD REPLY
0
Entering edit mode

The problem is "cbind" automatically convert strings to factor numbers, e.g. "chr1" => "1", "chrM" => "23", "chrX" => "24", "chrM" to "25".

ADD REPLY
0
Entering edit mode

No the problem is in read.table

ADD REPLY
0
Entering edit mode

strings are read As Factors? Is that True?

;-)

ADD REPLY
7
Entering edit mode
8.2 years ago
ddiez ★ 2.0k

Although the point in the comments about stringsAsFactors option is TRUE :-), the real problem in your specific case is that cbind is coercing your data into a matrix. In R, a matrix, by definition, can only have a single data type. All integer, numeric, character or factor. See the following code examples:

# stringsAsFactors = TRUE
# wrong because the factors are coerced as numeric.
d <- data.frame(
  chr = c("A", "B"),
  start = c(1, 2),
  stringsAsFactors = TRUE
)
cbind(d$chr, d$start)
     [,1] [,2]
[1,]    1    1
[2,]    2    2

# stringsAsFactors = FALSE
# wrong because the numbers are coerced as character.
d <- data.frame(
  chr = c("A", "B"),
  start = c(1, 2),
  stringsAsFactors = FALSE
)
cbind(d$chr, d$start)
     [,1] [,2]
[1,] "A"  "1" 
[2,] "B"  "2"

So, if you use cbind, no matter how you set stringsAsFactors originally or whether you use readr or any other tool to read your data you screw, because a matrix can only have one type of data and you have two. The solution is to use a data.frame, which can handle different data types:

data.frame(chr2 = d$chr, start2 = d$start)
  chr2 start2
1    A      1
2    B      2

Don't forget to set stringsAsFactors as desired.

EDIT:

Note that cbind is doing this because you are passing two vectors. If you pass them as data.frame, cbind treats them as such and this problem is avoided:

cbind(d[, "chr", drop = FALSE], d[, "start", drop = FALSE])
  chr start
1   A     1
2   B     2

Of course, this solution is a lot more verbose.

ADD COMMENT
0
Entering edit mode

Good catch, I stopped reading after seeing the stringsAsFactors issue.

ADD REPLY
0
Entering edit mode

Thanks. Almost gave up myself because there were a lot of good comments. A love working with R but these nuances can be really frustrating.

ADD REPLY
0
Entering edit mode

This is the best answer!

"cbind" causes a lot of problems, and using "data.frame" the way you mentioned resolved all the troublesome issues.

ADD REPLY
3
Entering edit mode
8.2 years ago

This is a benefit of using the readr package rather than base R when reading tables, the stringsAsFactors option (this is what you were looking for) is set in a more coherent way.

ADD COMMENT
0
Entering edit mode

actually just checked, "read.table" takes "stringsAsFactors" and it worked!

Thanks!

ADD REPLY
0
Entering edit mode

Remember to "accept" the answers (use check mark against the answers) that solved your problem. You can choose more than one.

ADD REPLY
3
Entering edit mode
8.2 years ago

The correct way -- and by correct, I mean correct: to specify with complete precision -- to solve this is to use colClasses with read.table, which coerces columns into type classes, like character, numeric, factor, etc.

For instance, in your case:

read.table(someFile, ..., colClasses=c("character", "numeric", "numeric"))

See ?read.table for more information.

ADD COMMENT
1
Entering edit mode

OK, this works.

However, if the column names of "someFile" varies from file to file, it would be impossible to predefine which column is "character" or "numeric". Is there a way to force "cbind"?

Thanks.

ADD REPLY
1
Entering edit mode

You can pass stringsAsFactors = FALSE to read.table so no need to specify all the column classes.

ADD REPLY
0
Entering edit mode

Also argument as.is = TRUE does the same trick.

ADD REPLY

Login before adding your answer.

Traffic: 3550 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6