Question

Identical files present in two different directories - R programming

0

Entering edit mode

9.4 years ago

SGMS ▴ 130

Hi,

I was wondering whether any of you is aware of a command which is similar to "diff" command in unix. I need to loop through files to make a calculation based on their name (which is the same).

I did find how to do this in unix but I need to do it in R. Any help would be great.

Thank you

R identical files diff command • 2.6k views

ADD COMMENT • link updated 9.4 years ago by Giovanni M Dall'Olio 28k • written 9.4 years ago by SGMS ▴ 130

0

Entering edit mode

The normal way to do this in R would be to read in both files and then compare them in memory.

ADD REPLY • link 9.4 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks Ryan,

I'm not sure if this is going to work though. I will first need to read them and then I will be comparing the content of each pair of the identical files.

Do you think it would work?

Thanks

ADD REPLY • link 9.4 years ago by SGMS ▴ 130

score 4 · Answer 1 · 2016-03-18

using md5sum

This method will tell you if two files are exactly identical. It is the fastest method but you won't be able to tell if the files differ for just a comma, or if they are totally different

> library(tools)
> md5sum("file1.txt")
                         file1.txt
"dad2655f134033752623fc8a33688d36"
> md5sum("file2.txt")
                         file2.txt
"dad2655f134033752623fc8a33688d36"

> md5sum("file3.txt")
                         file3.txt
"2dd556197e6bc2766fe1247e0afd678b"

using scan

If you think the files differ just for a few fields, you can use scan() to read the contents, without having to resort to read.csv (which is slower and memory intensive).

> scan(file="file1.txt", "raw")
Read 8 items
[1] "gene"  "value" "MGAT1" "2"     "MGAST" "1"     "AABCD" "5"
>
> scan(file="file3.txt", "raw")
Read 8 items
[1] "gene"  "value" "DIFFERENT" "2"     "MGAST" "1"     "AABCD" "5"
>

> scan(file="file1.txt", "raw") == scan("file3.txt", "raw")
Read 8 items
Read 8 items
[1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

EDIT

Following the clarification, I would generate a dataframe with the list of all the files in the folder, and use that to identify the duplicated names. It's a bit overkill to create a dataframe, but if the file structure is complex, it should be more organized.

.
├── README
├── file1.txt
├── file2.txt
└── other
    └── file1.txt

> library(dplyr)
> library(tools)
> myfiles = data.frame(path=list.files(".", full.name=T, recursive=T), stringsAsFactors=F)   %>% 
    mutate(filename=basename(path))
               path  filename
1          ./README    README
2       ./file1.txt file1.txt
3       ./file2.txt file2.txt
4 ./other/file1.txt file1.txt

If the files are not too big, I would consider doing a md5sum on all the files directly:

> myfiles %>% mutate(md5=md5sum(path))
               path  filename                              md5
1          ./README    README 9fd2cfdc6eab90df4f5bdd70913eaf22
2       ./file1.txt file1.txt 1a3e7e650c8cd5dd51a84cf84fd66f5c
3       ./file2.txt file2.txt d41d8cd98f00b204e9800998ecf8427e
4 ./other/file1.txt file1.txt d41d8cd98f00b204e9800998ecf8427e

> myfiles %>% 
    mutate(md5=md5sum(path)) %>% 
    group_by(md5) %>% 
    summarise(totfiles=n(), files=paste(path, collapse=','))
Source: local data frame [3 x 3]

                               md5 totfiles                         files
                             (chr)    (int)                         (chr)
1 1a3e7e650c8cd5dd51a84cf84fd66f5c        1                   ./file1.txt
2 9fd2cfdc6eab90df4f5bdd70913eaf22        1                      ./README
3 d41d8cd98f00b204e9800998ecf8427e        2 ./file2.txt,./other/file1.txt

This tells you that my file2.txt and other/file1.txt are identical (they are actually empty files)

If the files are too many and you want to apply the function only to those that have the same name, just use duplicated() to filter duplicates:

> myfiles %>% filter(duplicated(filename) | duplicated(filename,fromLast=T))
               path  filename
1       ./file1.txt file1.txt
2 ./other/file1.txt file1.txt

> myfiles %>% 
      filter(duplicated(filename) | duplicated(filename,fromLast=T))  %>%  # filter duplicated file names only
      mutate(md5=md5sum(path)) %>% 
      group_by(md5) %>% 
      summarise(totfiles=n(), files=paste(path, collapse=','))
Source: local data frame [2 x 3]

                               md5 totfiles             files
                             (chr)    (int)             (chr)
1 1a3e7e650c8cd5dd51a84cf84fd66f5c        1       ./file1.txt
2 d41d8cd98f00b204e9800998ecf8427e        1 ./other/file1.txt