March 2015 – Evan Nemerson

A while back I started working on a project called Squash, and today I’m pleased to announce the first release, version 0.5.

Squash is an abstraction layer for general-purpose data compression (zlib, LZMA, LZ4, etc.). It is based on dynamically loaded plugins, and there are a lot of them (currently 25 plugins to support 42 different codecs, though 2 plugins are currently disabled pending bug fixes from their respective compression libraries), covering a wide range of compression codecs with vastly different performance characteristics.

The API isn’t final yet (hence version 0.5 instead of 1.0), but I don’t think it will change much. I’m rolling out a release now in the hope that it encourages people to give it a try, since I don’t want to commit to API stability until a few people have given it a try. There is currently support for C and Vala, but I’m hopeful more languages will be added soon.

So, why should you be interested in Squash? Well, because it allows you to support a lot of different compression codecs without changing your code, which lets you swap codecs with virtually no effort. Different algorithm perform very differently with different data and on different platforms, and make different trade-offs between compression speed, decompression speed, compression ratio, memory usage, etc.

One of the coolest things about Squash is that it makes it very easy to benchmark tons of different codecs and configurations with your data, on whatever platform you’re running. To give you an idea of what settings might be interesting to you I also created the Squash Benchmark, which tests lots of standard datasets with every codec Squash supports (except those which are disabled right now) at every preset level on a bunch of different machines. Currently that is 28 datasets with 39 codecs in 178 different configurations on 8 different machines (and I’m adding more soon), for a total of 39,872 different data points. This will grow as more machines are added (some are already in progress) and more plugins are added to Squash.

There is a complete list of plugins on the Squash web site, but even with the benchmark there is a pretty decent amount of data to sift through, so here are some of the plugins I think are interesting (in alphabetical order):

bsc: libbsc targets very high compression ratios, achieving ratios similar to ZPAQ at medium levels, but it is much faster than ZPAQ. If you mostly care about compression ratio, libbsc could be a great choice for you.
DENSITY: DENSITY is fast. For text on x86_64 it is much faster than anything else at both compression and decompression. For binary data decompression speed is similar to LZ4, but compression is faster. That said, the compression ratio is relatively low. If you are on x86_64 and mostly care about speed DENSITY could be a great choice, especially if you’re working with text.
LZ4: You have probably heard of LZ4, and for good reason. It has a pretty good compression ratio, fast compression, and very fast decompression. It’s a very strong codec if you mostly care about speed, but still want decent compression.
LZHAM: LZHAM compresses similarly to LZMA, both in terms of ratio and speed, but with faster decompression.
Snappy: Snappy is another codec you’ve probably heard of. Overall, performance is pretty similar to LZ4—it seems to be a bit faster at compressing than LZ4 on ARM, but a bit slower on x86_64. For compressing small pieces of data (like fields.c from the benchmark) nothing really comes close. Decompression speed isn’t as strong, but it’s still pretty good. If you have a write-heavy application, especially on ARM or with small pieces of data, Snappy may be the way to go.

If you’re like me, when you download a project and want to build it the first thing you do is look for a configure script (or maybe ./autogen.sh if you are building from git). Lots of times I don’t bother reading the INSTALL file, or even the README. Most of the time this works out well, but sometimes there is no such file. When that happens, more often than not there is a CMakeLists.txt, which means the project uses CMake for its build system.

The realization that that the project uses CMake is, at least for me, quickly followed by a sense of disappointment. It’s not that I mind that a project is using CMake instead of Autotools; they both suck, as do all the other build systems I’m aware of. Mostly it’s just that CMake is different and, for someone who just wants to build the project, not in a good way.

First you have to remember what arguments to pass to CMake. For people who haven’t built many projects with CMake before this often involves having to actually RTFM (the horrors!), or a consultation with Google. Of course, the project may or may not have good documentation, and there is much less consistency regarding which flags you need to pass to CMake than with Autotools, so this step can be a bit more cumbersome than one might expect, even for those familiar with CMake.

After you figure out what arguments you need to type, you need to actually type them. CMake has you define variables using -DVAR=VAL for everything, so you end up with things like -DCMAKE_INSTALL_PREFIX=/opt/gnome instead of --prefix=/opt/gnome. Sure, it’s not the worst thing imaginable, but let’s be honest—it’s ugly, and awkward to type.

Enter configure-cmake, a bash script that you drop into your project (as configure) which takes most of the arguments configure scripts typically accept, converts them to CMake’s particular style of insanity, and invokes CMake for you. For example,

./configure --prefix=/opt/gnome CC=clang CFLAGS="-fno-omit-frame-pointer -fsanitize=address"

Will be converted to

cmake . -DCMAKE_BUILD_TYPE=Debug -DCMAKE_INSTALL_PREFIX=/opt/gnome -DCMAKE_INSTALL_LIBDIR=/opt/gnome/lib -DCMAKE_C_COMPILER=clang -DCMAKE_C_FLAGS="-fno-omit-frame-pointer -fsanitize=address"

Note that it assumes you’re including the GNUInstallDirs module (which ships with CMake, and you should probably be using). Other than that, the only thing which may be somewhat contentious is that it adds -DCMAKE_BUILD_TYPE=Debug—Autotools usually builds with debugging symbols enabled and lets the package manager take care of stripping them, but CMake doesn’t. Unfortunately some projects use the build type to determine other things (like defining NDEBUG), so you can get configure-cmake to pass “Release” for the build type by passing it <code>–disable-debug</code>, one of two arguments that don’t mirror something from Autotools.

Sometimes you’ll want to be able to pass non-standard argument to CMake, which is where the other argument that doesn’t mirror something from Autotools comes in; --pass-thru (--pass-through, --passthru, and --passthrough also work), which just tells configure-cmake to pass all subsequent arguments to CMake untouched. For example:

./configure --prefix=/opt/gnome --pass-thru -DENABLE_AWESOMENESS=yes

Of course none of this replaces anything CMake is doing, so people who want to keep calling cmake directly can.

So, if you maintain a CMake project, please consider dropping the configure script from configure-cmake into your project. Or write your own, or hack what I’ve done into pieces and use that, or really anything other than asking people to type those horrible CMake invocations manually.

Month: March 2015

Squash 0.5 and the Squash Benchmark

Making CMake more user-friendly