Question

read.table does not read in all rows!

3

Entering edit mode

8.5 years ago

Parham ★ 1.6k

Hi,

I encountered a strange issue while reading in a data table from txt format. If I read it from txt by read.table it does not include all rows but if I convert to csv and read it with read.csv its perfect. Does someone know the issue or is it my code?

Here is the file.

test <- read.table("./Annotations/all_genes_pombase.txt",
+ header=T,
+ sep="\t",
+ row.names=1,
+ stringsAsFactors = F)
> dim(test)
[1] 4533    7
> str(test)
'data.frame':   4533 obs. of  7 variables:
 $ name        : chr  "SPAC1002.01" "pom34" "gls2" "taf11" ...
 $ chromosome  : chr  "I" "I" "I" "I" ...
 $ description : chr  "conserved fungal protein " "nucleoporin Pom34 " "glucosidase II alpha subunit Gls2 " "transcription factor TFIID complex subunit Taf11 (predicted) " ...
 $ feature_type: chr  "protein_coding" "protein_coding" "protein_coding" "protein_coding" ...
 $ strand      : int  1 1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ start       : int  1798347 1799061 1799915 1803624 1804548 1807270 1807996 1809480 1811408 1813740 ...
 $ end         : int  1799015 1800053 1803141 1804491 1806797 1807781 1809433 1811361 1813805 1815796 ...

R • 26k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 8.5 years ago by Parham ★ 1.6k

2

Entering edit mode

try using read.delim instead and specifying the corresponding arguments for your text file

ADD REPLY • link 8.5 years ago by steve ★ 3.5k

0

Entering edit mode

Thanks steve and Devon! read.delim works fine!

ADD REPLY • link 8.5 years ago by Parham ★ 1.6k

0

Entering edit mode

also it would be more useful to see the actual source txt file, using something like head in the terminal

ADD REPLY • link 8.5 years ago by steve ★ 3.5k

1

Entering edit mode

The file is linked to in the post.

ADD REPLY • link 8.5 years ago by Devon Ryan 105k

0

Entering edit mode

It works fine with read.delim, for whatever that's worth.

ADD REPLY • link 8.5 years ago by Devon Ryan 105k

0

Entering edit mode

My solution is like this: suppose gene.tab.matrix.v4.txt is the file you want to read. what you need to do is replace all special symbols such as `, (, ) to - and then read it with read.table. Perl script to replace these special symbols is like this way:

perl -p -i -e "s/\'/-/g" gene.tab.matrix.v4.txt
perl -p -i -e "s/\(/-/g" gene.tab.matrix.v4.txt
perl -p -i -e "s/\)/-/g" gene.tab.matrix.v4.txt

ADD REPLY • link updated 5.2 years ago by Ram 45k • written 5.2 years ago by Shicheng Guo ★ 9.6k

3

Entering edit mode

A few points:

This post is 3+ years old and has been answered well already
The question is about customizing R code to read a file properly, not manipulating a file so R won't have a problem with it. Changing data to suit code is bad practice
In-place editing should not be recommended without proper warnings. Your commands will edit the file in-place and possibly wreck user's data.
Nowhere does OP specify problems with the special characters you mention, and this makes your post mostly irrelevant to this question.

ADD REPLY • link 5.2 years ago by Ram 45k

score 15 · Accepted Answer · 2016-11-14

15

Entering edit mode

8.5 years ago

Santosh Anand 5.8k

Use the argument quote = "" inside read.table.

read.table("your_file", quote="", other.arguments)

Explanation: Your data has a single quote on 59th line (( pyridoxamine 5'-phosphate oxidase (predicted)). Then there is another single quote, which complements the single quote on line 59, is on line 137 (5'-hydroxyl-kinase activity...). Everything within quote will be read as a single field of data, and quotes can include the newline character also. That's why you lose the lines in between. quote = "" disables quoting altogether.

There are other more instances where this 'quoting' happens again. One way to know how many fields read.table sees in every row is by using count.fields

num.fields = count.fields("all_genes_pombase.txt", sep="\t")

Now look at the variable num.fields, there will be a lot of NAs, the lines which are not read correctly by read.table

The problem doesn't arise with read.csv because the quoting defaults are different in read.table and read.csv, due to some reason really unknown to me!

read.table: quote = "\"'"
read.csv: quote = "\""

PS: The best way to avoid the reading file nuisance of read.table is to use fread() from data.table package. The side benefit is that it's blazing fast for large files and it guesses the field separator automatically. See my earlier post: A: How to import huge .csv files in R studio?

ADD COMMENT • link 8.5 years ago by Santosh Anand 5.8k

0

Entering edit mode

Glad I refreshed before posting. I had just noticed the quoting issue, but your explanation is much more detailed than mine would have been :)

ADD REPLY • link 8.5 years ago by Devon Ryan 105k

0

Entering edit mode

thanks for the appreciation :)

ADD REPLY • link 8.5 years ago by Santosh Anand 5.8k

1

Entering edit mode

I also appreciate that you'd have to come to Biostars to get this answer because < 3% of the coders on the Stack* pages are in bioinformatics/biotech field and no answers there talk about the 3' and 5' :)

ADD REPLY • link 4.5 years ago by bjwiley23 ▴ 40

0

Entering edit mode

Thanks a million for thorough explanation and troubleshoot! Very handy tips =)

ADD REPLY • link 8.5 years ago by Parham ★ 1.6k

0

Entering edit mode

Happy that it was helpful :). read.table is one of R's worst nightmare

ADD REPLY • link 8.5 years ago by Santosh Anand 5.8k

0

Entering edit mode

Oh I need to edit a bunch of scripts to use fread() from now on. Lovely when R trips you up like this.