83

I do a ton of file compression. Most of the stuff I am compressing is just code, so I need to use lossless compression.

I wondered if there was anything that offers a better size reduction than 7zip. It doesn't matter how long it takes to compress or decompress; size is all that matters.

Does anyone know how the various tools and compression algorithms available in Linux compare for compressing text? Or is 7zip the best for compressing source code?

Kusalananda
  • 320,670
  • 36
  • 633
  • 936
Zach
  • 961
  • 1
  • 7
  • 5

6 Answers6

105

lrzip is what you're really looking for, especially if you're compressing source code!

Quoting the README:

This is a compression program optimised for large files. The larger the file and the more memory you have, the better the compression advantage this will provide, especially once the files are larger than 100MB. The advantage can be chosen to be either size (much smaller than bzip2) or speed (much faster than bzip2). [...]The unique feature of lrzip is that it tries to make the most of the available ram in your system at all times for maximum benefit.

lrzip works by first scanning for and removing any long-distance data redundancy with an rzip-based algorithm, then compressing the non-redundant data.

Con Kolivas provides a fantastic example in the Linux Kernel Mailing List; wherein he compresses a 10.3GB tarball of forty Linux Kernel releases down to 163.9MB (1.6%), and does so faster than xz. He wasn't even using the most aggressive second-pass algorithm!

I'm sure you'll have great results compressing massive tarballs of source code :)

sudo apt-get install lrzip

Example (using default for others options):

Ultra compression, dog slow:

lrzip -z file

For folders, just change lrzip for lrztar

Jakuje
  • 20,974
  • 7
  • 51
  • 70
Alexander Riccio
  • 1,151
  • 1
  • 7
  • 3
  • 1
    I also can contest that `lrzip` also works really great for backups of `tar/cpio/pax`'ed system file trees, because those usually contain lots of long range redundancies, something that `lrzip` is really good at compressing. – Franki Nov 27 '14 at 07:11
  • 13
    I've tried `lrzip` and `pixz` on a 19 GB text file. Both took about half an hour to compress it (on a hexa-core machine), but the `lrz` file was half the size of the `xz` file (2.7 vs. 4.4 GB). So, another vote for this answer instead. – fnl Jan 20 '15 at 12:04
  • 4
    @Franki by 'contest', do you mean 'attest'? – mitchus Nov 02 '15 at 11:07
  • 4
    Feels like Pied Piper! – Denys Vitali Oct 29 '16 at 14:02
  • 1
    Do you know what the difference would be between lrzip and rzip? rzip looks like it was released in 1998 designed to do best on very large files with long distance redundancy, so it sounds similar to lrzip -- just wondering if lrzip was derived from rzip? (rzip from http://rzip.samba.org/) – Astara Jan 17 '17 at 09:56
  • 1
    on archlinux (with 2gb ram), while trying to compress 56Mb pdf, it gets to 18 chunks, then `Illegal instruction (core dumped)` – ierdna Jun 16 '17 at 12:19
  • @andrei can you elaborate? – Alexander Riccio Jun 16 '17 at 14:20
  • i installed lrzip on archlinux, then ran `lrzip -z document.pdf` (the 56Mb file), it started working (counting 'chunks'), counted to 18, then threw that error – ierdna Jun 16 '17 at 15:02
  • @andrei well it sounds like you hit a bug. What version is installed and if it's the latest then we'll report it. Lemme grab the link. – Alexander Riccio Jun 16 '17 at 15:10
  • @andrei turn on verbose/debug mode - it may help here. The place we're gonna submit the bug is: https://github.com/ckolivas/lrzip/issues – Alexander Riccio Jun 16 '17 at 15:26
  • I Submitted a bug report. Unfortunately I can't provide the PDF file because it contains confidential medical information. – ierdna Jun 16 '17 at 15:36
57

7zip is more a compactor (like PKZIP) than a compressor. It's available for Linux, but it can only create compressed archives in regular files, it's not able to compress a stream for instance. It's not able to store most of Unix file attributes like ownership, ACLs, extended attributes, hard links...

On Linux, as a compressor, you've got xz that uses the same compression algorithm as 7zip (LZMA2). You can use it to compress tar archives.

Like for gzip and bzip2, there's a parallel variant pixz that can leverage several processors to speed up the compression (xz can also do it natively since version 5.2.0 with the -T option). The pixz variant also supports indexing a compressed tar archive which means it's able to extract a single file without having to uncompress the file from the start.


Footnote

Compact is archive+compress (possibly with indexing, possibly members compressed separately), archiving doesn't imply compression. It is not a DOS thing, but possibly it was a French thing. Googling usenet archives, I seem to only come across articles of mine, so it could well have been my invention, though I strongly believe it's not.

ctrl-alt-delor
  • 27,473
  • 9
  • 58
  • 102
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
  • How does 7zip / xz compare to bzip2? – 9000 Nov 14 '14 at 17:06
  • @9000 See [Why are tar archive formats switching to xz compression to replace bzip2 and what about gzip?](http://unix.stackexchange.com/q/108100/22565) – Stéphane Chazelas Nov 14 '14 at 17:49
  • @StéphaneChazelas your first explenation of compaction and compression says nothing. It only decodes verb endings, that any English speaker can do for them self. I will move the better explanation to the answer. – ctrl-alt-delor Aug 26 '22 at 13:41
7

(updated answer) If time doesn't matter, use ZPAQ v1.10 (or newer) ex.:
zpaq pvc/usr/share/doc/zpaq/examples/max.cfg file.zpaq file.tar (the max.cfg file location may vary, check on your installed package file list)

zpaq actually compressed more than kgb -9 newFileName.kgb yourFileName.tar.
That is based on older algorithm PAQ6, and is very slow...
I tested with all other compressors like 7zip, lrzip, bzip2, kgb.. and zpaq compressed most!


If kgb still interests you tho: (as it was my initial choice on this answer, so I am keeping the information here)
Ubuntu 14.04 has kgb 1.0b4, run sudo apt-get install kgb to install it.

Below is about a windows version that you can try to run/compile kgb on linux, but I did not succeed.
Version 2 beta2 can be found on SourceForge, but no Linux binaries are available. You can try to run it in console with wine kgb2_console.exe -a7 -m9 (method -a6 -m9 seems to be equivalent to the best method in 1.0b4, -a7 is new in 2 beta2). Though I had better stability by installing .NET 2.0 with winetricks and running wine "KGB Archiver 2 .net.exe" (I don't like a little bit doing that, so I will stick with native Linux 1.0b4 that has almost the same result as 2 beta2).
Anyway, version 2 beta2 seriously deserves a Linux native version too! Maybe something can be accomplished with MinGW, see this, but this command still fails badly: i586-mingw32msvc-g++ kgb2_console.cpp -o kgb. May be try to compile it with dmcs (Mono)? see this tip.

Aquarius Power
  • 4,099
  • 5
  • 38
  • 56
7

If you're looking for greatest size reduction regardless of compression speed, LZMA is likely your best option.

When comparing the various compressions, generally the tradeoff is time vs. size. gzip tends to compress and decompress relatively quickly while yielding a good compression ratio. bzip2 is somewhat slower than gzip both in compression and decompression time, but yields even greater compression ratios. LZMA has the longest compression time but yields the best ratios while also having a decompression rate outperforming that of bzip2.

Sources: http://bashitout.com/2009/08/30/Linux-Compression-Comparison-GZIP-vs-BZIP2-vs-LZMA-vs-ZIP-vs-Compress.html

http://tukaani.org/lzma/benchmarks.html

j883376
  • 2,493
  • 1
  • 19
  • 15
  • 1
    I need to disagree on this one! The lossless file compressor providing the greatest reduction factor regardless of compression speed that works on GNU/Linux is probably either `zpaq` or `paq8l`. However, they are so slow that they are unpracticle for most real-world usages. – Franki Nov 27 '14 at 07:15
  • @Franki cool `sudo apt-get install zpaq`, I did some tests, according to [wiki](https://en.wikipedia.org/wiki/PAQ), that app would be the newest 2009, but it still loses for [kgb](http://unix.stackexchange.com/a/167991/30352) (that uses PAQ6), but kgb is MUCH slower... – Aquarius Power May 03 '16 at 22:09
  • @Franki actually, I just found that `zpaq pvc/usr/share/doc/zpaq/examples/max.cfg file.zpaq file.tar` compresses more than `kgb -9` – Aquarius Power May 10 '16 at 02:17
  • @Franki How does `zpaq` or `paq81` compare to `lrzip`? – Alexej Magura Jan 17 '20 at 17:43
4

Zstandard deserves a mention. Even though with default settings it doesn't compress as well as xz, it is much faster at both compression and decompression. When Arch Linux switched from xz to zstd, they reported

~0.8% increase in package size on all of our packages combined, but the decompression time for all packages saw a ~1300% speedup

Today I compressed the same 684M text corpus with xz and zstd. I didn't do any rigorous testing, YMMV, but the differences are so huge it doesn't seem necessary:

  • xz took 9m36s to compress that to 71M, decompressing in 9s
  • zstd (default options) took 6s to compress it to 123M, decompressing in <2s
  • zstd -9 took 42s to compress it to 99M, again decompressing in <2s.
  • zstd -19 is slower than xz at 12m40s, but compresses even better to 70M, and still decompresses in <2s.
unhammer
  • 326
  • 4
  • 13
2

7zip is no unique technology, but supports several different compression methods (see wikipedia 7z on that).

A set of tests was performed with different tools specially for C source files. I'm not sure which of the tools exist for Linux if they still exist. However, you may note that the best algorithm was PPM with modifications (PPMII, then PPMZ).

If you are interested in the tools, you can browse the site, it's in Russian but google translate may help. There is a big deposit of binaries, which you may use (or won't be able) from Linux with wine, if really needed.

Rui F Ribeiro
  • 55,929
  • 26
  • 146
  • 227