Pre-compressing files with Hakyll

5 August 2016 • hakyll, howto, programming

Usually, HTTP responses are compressed in real time, on the web server. That incurs some overhead for each request, but is the only way to go when dealing with dynamic content.

We in the static generation camp have some choice, though. Both Nginx and Apache can be configured to serve pre-compressed files. Apart from eliminating the overhead, this approach lets us use levels of compression that aren’t feasible in real-time setting, but are perfectly reasonable offline.

Faster and smaller responses. What’s not to like?

Don’t obsess over compression levels, though. For files smaller than 100Kb, ten percent extra of compression translates into mere hundreds of bytes saved. Not too impressive. (But do run your own benchmarks.)

Overall, this isn’t the top optimization there is; yet, it’s quite cheap and accessible, so why not implement it?

Let’s get our hands dirty, then. How shall we go about coding this? Well, we might start with recognizing that we want each input file to produce two outputs—one gzipped, the other not. That, in turn, immediately leads us to Hakyll’s tutorial on versioning. So if you have code along the lines of:

match "posts/*" $ do
    route   $ setExtension "html"
    compile $ pandocCompiler
          >>= loadAndApplyTemplate "templates/default.html" postCtx

…you should add a new block that will look like this:

match "posts/*" $ version "gzipped" $ do
    route   $ setExtension "html.gz"
    compile $ pandocCompiler
          >>= loadAndApplyTemplate "templates/default.html" postCtx
          >>= gzip  -- to be defined later

That will definitely work, but you probably aren’t happy about code duplication and the fact that Hakyll will now do the same work twice. Neither am I, so let’s press on!

Usually, duplication is eliminated with snapshots, but if we add saveSnapshot at the end of the first block and use loadSnapshotBody in the second, Hakyll will give us a runtime error due to dependency cycle: gzipped version of the item will depend on itself. Bummer!

The thing is, versions are part of identifiers. That’s only logical: to distinguish X from another X, you label one with “gzipped”, and now it’s easy to tell them apart—one is just “X”, another is “X (gzipped)”. In Hakyll, that means that when you’re running, say, loadSnapshotBody from inside a block wrapped in version "gzipped", you’ll be requesting a snapshot of identifier that’s labeled “gzipped”. That’s what causes a dependency cycle.

Luckily for us, Hakyll exports functions for manipulating identifier’s version. So our second code block will now look as follows:

match "posts/*" $ version "gzipped" $ do
    route   $ setExtension "html.gz"
    compile $ do
        id <- getUnderlying
        body <- loadBody (setVersion Nothing id)
        makeItem body
            >>= gzip

As you can see, we’re obtaining the current identifier (which is versioned because of version "gzipped") and modifying it so that it references the unversioned item. Note that we must use makeItem there—had we tried to gzip an item returned by load, we’d get a runtime error, because identifier of the item we’d be returning won’t have the appropriate version.

One caveat with the code above is that loadBody won’t work for files compiled with copyFileCompiler (because the latter doesn’t really copy the contents of the file into the cache, from which loadBody reads). For such files, we’ll have to use another approach:

match "images/*.svg" $ version "gzipped" $ do
    route   $ setExtension "svg.gz"
    compile $ getResourceBody
          >>= gzip

This code circumvents the problem by reading the file straight from the disk.

With versions sorted out, it’s time to turn our attention to implementing gzip. Luckily, this part is much simpler: Hakyll already provides a means for running external programs. All we have to do is convert item’s body from String to lazy ByteString (on an assumption that it’s UTF-8); the reason being that the binary returned by compressor is not a textual string and might not be representable with String:

gzip :: Item String -> Compiler (Item LBS.ByteString)
gzip = withItemBody
           (unixFilterLBS "gzip" ["--best"]
           . LBS.fromStrict
           . TE.encodeUtf8
           . T.pack)

And that’s it. You can now go add that code into your site’s config and experience the major drawback of this solution, namely the fact that it requires separate rules for different filename extensions. If you have a Markdown file compiled into HTML and a bunch of SVG files that are just copied over, you’ll have to write two rules. If you find a way to scrap that boilerplate, please let me know; my email is at the end of this post.

Whoa, you got through that meandering mess of an article! That makes two of us. As a reward, I’m going to tell you how to use Zopfli to gzip your files. It’s the best DEFLATE compressor out there, and using it goes against my earlier advice of not obsessing over the byte count, but whatever; it’s fun.

So, Zopfli. The trouble with that compressor is that it’s not a Unix filter—it doesn’t accept data on standard input. In order to use it, we have to write item’s body into a temporary file, compress that, then read the result back. Fortunately, Zopfli supports writing the result into stdout; that allows us to make do with safer of the functions provided by Hakyll. (If it wasn’t the case, we’d have to resort to unsafeCompiler). So here’s the code:

gzip item = do
  (TmpFile tmpFile) <- newTmpFile "gzip.XXXXXXXX"
  withItemBody
      (unixFilter "tee" [tmpFile])
      item
  body <- unixFilterLBS
              "zopfli"
              [ "-c"      -- write result to stdout
              , tmpFile]
              (LBS.empty) -- no need to feed anything on stdin

  makeItem body

Simple, right? If you’re using anything less than --i100, though, consider 7-zip—at its best (-mx9) it’s very close to default Zopfli, but 7z is wa-a-ay faster, and can behave as a filter (use -si -so).

P.S. Right before publishing this post, I was searching for some other Hakyll-related stuff and stumbled upon a three-years-old conversation on the mailing list that covers everything but the Zopfli bit. Search engines will kill blogging.

Your thoughts are welcome by email
(here’s why my blog doesn’t have a comments form)