49

I have two big files (6GB each). They are unsorted, with linefeeds (\n) as separators. How can I diff them? It should take under 24h.

Michael Mrozek
  • 91,316
  • 38
  • 238
  • 232
Jonas Lejon
  • 679
  • 1
  • 6
  • 10

4 Answers4

64

The most obvious answer is just to use the diff command and it is probably a good idea to add the --speed-large-files parameter to it.

diff --speed-large-files a.file b.file

You mention unsorted files so maybe you need to sort the files first

sort a.file > a.file.sorted
sort b.file > b.file.sorted
diff --speed-large-files a.file.sorted b.file.sorted

you could save creating an extra output file by piping the 2nd sort output direct into diff

sort a.file > a.file.sorted
sort b.file | diff --speed-large-files a.file.sorted -

Obviously these will run best on a system with plenty of available memory and you will likely need plenty of free disk space too.

It wasn't clear from your question whether you have tried these before. If so then it would be helpful to know what went wrong (took too long etc.). I have always found that the stock sort and diff commands tend to do at least as well as custom commands unless there are some very domain specific properties of the files that make it possible to do things differently.

Richm
  • 3,812
  • 23
  • 15
  • 2
    +1. You can omit all temporary files with named pipes. Use `mkfifo` to create `[ab].file.sorted` before using them as output for `sort`. Put both `sort`s with `&` in the background and use the both piped as filenames for diff. – krissi Sep 16 '10 at 11:45
  • 15
    @krissi You can also accomplish the same effect using this syntax: `diff <(command 1) <(command 2)` – Michael Mrozek Sep 16 '10 at 14:12
  • Thanks worked. I needed a couple of GB of memory thought but a 16GB Amazon EC2 instance fixed it :) – Jonas Lejon Sep 16 '10 at 15:28
  • 7
    If someone like me wonders why `<(cmd1) <(cmd2)` syntax works (as it sounds like redirecting standard input twice!), try `echo hello <(cmd1) <(cmd2)`. You'll see something like `hello /dev/fd/63 /dev/fd/62` which suddenly makes it clear ;) – alex Sep 16 '10 at 19:53
  • 1
    Note that diff easily runs out of memory if you have really large files. For creating patches/deltas at least there are some other options: http://unix.stackexchange.com/a/77259/27186 though perhaps not for visually inspecting changes – unhammer Mar 05 '14 at 09:20
  • 5
    In my experience, the `--speed-large-files` option does not help if you do not have enough RAM. Also, pre-sorting is not helpful if you have a multi-line record structure you wish to preserve. The options referred to above (by @unhammer) are interesting, but the output from `rdiff` and `bsdiff` is rather binary. Installing `bdiff` from the Heirloom Toolbox looks like a dauning task (requires Heirloom devtools, extinct header files, …). Is it really worth the effort? Are there other alternatives? – Christian Pietsch Feb 02 '15 at 17:33
  • For large files with very few differences, where you want to visually inspect the differences, I use this ugly hack: https://github.com/unhammer/diff-large-files – it works for that specific purpose, when `diff` would otherwise run out of memory. (If anyone is wondering how `sort` manages to not run out of RAM, check the size of your /tmp folder while you're sorting that 10GB file … wouldn't it be nice if `diff` could do the same thing :).) – unhammer Feb 03 '15 at 09:49
  • Got a tip via https://news.ycombinator.com/item?id=13985216 – just `wget https://raw.githubusercontent.com/Arkanosis/Arkonf/master/tools-src/bdiff.c` and `gcc -Wno-long-long bdiff.c -o bdiff`. For my test case at least, with few differences over large files, it uses zero memory! – unhammer Mar 30 '17 at 12:31
  • @MichaelMrozek you should put your command as an answer. – Arvind Sridharan Jul 08 '19 at 06:33
  • 1
    Thanks @unhammer and @ChristianPietsch - I was trying to `diff` two 5GB binary files on an 8GB machine and it kept responding with `Killed`. In my case, I used `split` to split each file into 1GB chunks, then ran each pair of chunks through `diff`. All I wanted to see was whether they were the same or not, so that was good enough for me. – mdmay74 Jun 21 '23 at 04:13
10

Sorting the inputs and telling the diff program it's inputs are sorted would provide a massive speed up. I don't know of any diff with an option like that but comm assumes sorted input and will be much quicker if it does enough for your purposes.

Karl
  • 201
  • 2
  • 3
1

The bdiff tool can work on unsorted files much larger than the computer's RAM.

Use these steps once, to download and compile bdiff before using it the first time:

wget https://github.com/Arkanosis/Arkonf/raw/master/tools-src/bdiff.c && \
  gcc -Wformat=0 -Wno-long-long bdiff.c -o bdiff && \
  rm bdiff.c

To run bdiff and compare 2 files:

./bdiff a.file b.file

You might find it helpful to redirect the bdiff output to a file. Thanks to @unhammer for the suggestion and the link to the Git repository.

Gogowitsch
  • 152
  • 5
0

I tried the solutions on this page when I had problems using diff on some large text files a couple of days ago, but didn't find anything that worked for me, so I wrote a file comparison program specifically to handle large text files. It seemed only fair to return here and let you know it's available. I've only used it myself and I'd appreciate if anyone else having problems with large text files could try it and report if it works - or doesn't - for you too. Code is at https://github.com/gtoal/bigfile-diff-compare