using md5sum
This method will tell you if two files are exactly identical. It is the fastest method but you won't be able to tell if the files differ for just a comma, or if they are totally different
> library(tools)
> md5sum("file1.txt")
file1.txt
"dad2655f134033752623fc8a33688d36"
> md5sum("file2.txt")
file2.txt
"dad2655f134033752623fc8a33688d36"
> md5sum("file3.txt")
file3.txt
"2dd556197e6bc2766fe1247e0afd678b"
using scan
If you think the files differ just for a few fields, you can use scan() to read the contents, without having to resort to read.csv (which is slower and memory intensive).
> scan(file="file1.txt", "raw")
Read 8 items
[1] "gene" "value" "MGAT1" "2" "MGAST" "1" "AABCD" "5"
>
> scan(file="file3.txt", "raw")
Read 8 items
[1] "gene" "value" "DIFFERENT" "2" "MGAST" "1" "AABCD" "5"
>
> scan(file="file1.txt", "raw") == scan("file3.txt", "raw")
Read 8 items
Read 8 items
[1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
EDIT
Following the clarification, I would generate a dataframe with the list of all the files in the folder, and use that to identify the duplicated names. It's a bit overkill to create a dataframe, but if the file structure is complex, it should be more organized.
.
├── README
├── file1.txt
├── file2.txt
└── other
└── file1.txt
> library(dplyr)
> library(tools)
> myfiles = data.frame(path=list.files(".", full.name=T, recursive=T), stringsAsFactors=F) %>%
mutate(filename=basename(path))
path filename
1 ./README README
2 ./file1.txt file1.txt
3 ./file2.txt file2.txt
4 ./other/file1.txt file1.txt
If the files are not too big, I would consider doing a md5sum on all the files directly:
> myfiles %>% mutate(md5=md5sum(path))
path filename md5
1 ./README README 9fd2cfdc6eab90df4f5bdd70913eaf22
2 ./file1.txt file1.txt 1a3e7e650c8cd5dd51a84cf84fd66f5c
3 ./file2.txt file2.txt d41d8cd98f00b204e9800998ecf8427e
4 ./other/file1.txt file1.txt d41d8cd98f00b204e9800998ecf8427e
> myfiles %>%
mutate(md5=md5sum(path)) %>%
group_by(md5) %>%
summarise(totfiles=n(), files=paste(path, collapse=','))
Source: local data frame [3 x 3]
md5 totfiles files
(chr) (int) (chr)
1 1a3e7e650c8cd5dd51a84cf84fd66f5c 1 ./file1.txt
2 9fd2cfdc6eab90df4f5bdd70913eaf22 1 ./README
3 d41d8cd98f00b204e9800998ecf8427e 2 ./file2.txt,./other/file1.txt
This tells you that my file2.txt and other/file1.txt are identical (they are actually empty files)
If the files are too many and you want to apply the function only to those that have the same name, just use duplicated() to filter duplicates:
> myfiles %>% filter(duplicated(filename) | duplicated(filename,fromLast=T))
path filename
1 ./file1.txt file1.txt
2 ./other/file1.txt file1.txt
> myfiles %>%
filter(duplicated(filename) | duplicated(filename,fromLast=T)) %>% # filter duplicated file names only
mutate(md5=md5sum(path)) %>%
group_by(md5) %>%
summarise(totfiles=n(), files=paste(path, collapse=','))
Source: local data frame [2 x 3]
md5 totfiles files
(chr) (int) (chr)
1 1a3e7e650c8cd5dd51a84cf84fd66f5c 1 ./file1.txt
2 d41d8cd98f00b204e9800998ecf8427e 1 ./other/file1.txt
The normal way to do this in R would be to read in both files and then compare them in memory.
Thanks Ryan,
I'm not sure if this is going to work though. I will first need to read them and then I will be comparing the content of each pair of the identical files.
Do you think it would work?
Thanks