Identical files present in two different directories - R programming
1
0
Entering edit mode
8.7 years ago
SGMS ▴ 130

Hi,

I was wondering whether any of you is aware of a command which is similar to "diff" command in unix. I need to loop through files to make a calculation based on their name (which is the same).

I did find how to do this in unix but I need to do it in R. Any help would be great.

Thank you

R identical files diff command • 2.1k views
ADD COMMENT
0
Entering edit mode

The normal way to do this in R would be to read in both files and then compare them in memory.

ADD REPLY
0
Entering edit mode

Thanks Ryan,

I'm not sure if this is going to work though. I will first need to read them and then I will be comparing the content of each pair of the identical files.

Do you think it would work?

Thanks

ADD REPLY
4
Entering edit mode
8.7 years ago

using md5sum

This method will tell you if two files are exactly identical. It is the fastest method but you won't be able to tell if the files differ for just a comma, or if they are totally different

> library(tools)
> md5sum("file1.txt")
                         file1.txt
"dad2655f134033752623fc8a33688d36"
> md5sum("file2.txt")
                         file2.txt
"dad2655f134033752623fc8a33688d36"

> md5sum("file3.txt")
                         file3.txt
"2dd556197e6bc2766fe1247e0afd678b"

using scan

If you think the files differ just for a few fields, you can use scan() to read the contents, without having to resort to read.csv (which is slower and memory intensive).

> scan(file="file1.txt", "raw")
Read 8 items
[1] "gene"  "value" "MGAT1" "2"     "MGAST" "1"     "AABCD" "5"
>
> scan(file="file3.txt", "raw")
Read 8 items
[1] "gene"  "value" "DIFFERENT" "2"     "MGAST" "1"     "AABCD" "5"
>

> scan(file="file1.txt", "raw") == scan("file3.txt", "raw")
Read 8 items
Read 8 items
[1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

EDIT

Following the clarification, I would generate a dataframe with the list of all the files in the folder, and use that to identify the duplicated names. It's a bit overkill to create a dataframe, but if the file structure is complex, it should be more organized.

.
├── README
├── file1.txt
├── file2.txt
└── other
    └── file1.txt

> library(dplyr)
> library(tools)
> myfiles = data.frame(path=list.files(".", full.name=T, recursive=T), stringsAsFactors=F)   %>% 
    mutate(filename=basename(path))
               path  filename
1          ./README    README
2       ./file1.txt file1.txt
3       ./file2.txt file2.txt
4 ./other/file1.txt file1.txt

If the files are not too big, I would consider doing a md5sum on all the files directly:

> myfiles %>% mutate(md5=md5sum(path))
               path  filename                              md5
1          ./README    README 9fd2cfdc6eab90df4f5bdd70913eaf22
2       ./file1.txt file1.txt 1a3e7e650c8cd5dd51a84cf84fd66f5c
3       ./file2.txt file2.txt d41d8cd98f00b204e9800998ecf8427e
4 ./other/file1.txt file1.txt d41d8cd98f00b204e9800998ecf8427e

> myfiles %>% 
    mutate(md5=md5sum(path)) %>% 
    group_by(md5) %>% 
    summarise(totfiles=n(), files=paste(path, collapse=','))
Source: local data frame [3 x 3]

                               md5 totfiles                         files
                             (chr)    (int)                         (chr)
1 1a3e7e650c8cd5dd51a84cf84fd66f5c        1                   ./file1.txt
2 9fd2cfdc6eab90df4f5bdd70913eaf22        1                      ./README
3 d41d8cd98f00b204e9800998ecf8427e        2 ./file2.txt,./other/file1.txt

This tells you that my file2.txt and other/file1.txt are identical (they are actually empty files)

If the files are too many and you want to apply the function only to those that have the same name, just use duplicated() to filter duplicates:

> myfiles %>% filter(duplicated(filename) | duplicated(filename,fromLast=T))
               path  filename
1       ./file1.txt file1.txt
2 ./other/file1.txt file1.txt

> myfiles %>% 
      filter(duplicated(filename) | duplicated(filename,fromLast=T))  %>%  # filter duplicated file names only
      mutate(md5=md5sum(path)) %>% 
      group_by(md5) %>% 
      summarise(totfiles=n(), files=paste(path, collapse=','))
Source: local data frame [2 x 3]

                               md5 totfiles             files
                             (chr)    (int)             (chr)
1 1a3e7e650c8cd5dd51a84cf84fd66f5c        1       ./file1.txt
2 d41d8cd98f00b204e9800998ecf8427e        1 ./other/file1.txt
ADD COMMENT
0
Entering edit mode

Thanks a lot Giovanni,

What you describe above is about finding if the content of two files is identical right? I need to only compare the files when they have the same name though (identical names).

Sorry for the confusion.

Thanks

ADD REPLY
1
Entering edit mode

Then his suggestion of MD5 is still applicable I think. The MD5 doesn't care about the file name or any file metadata in fact. Just the contents of the file.

ADD REPLY

Login before adding your answer.

Traffic: 2145 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6