Streaming Unzip with Go & AWS Lambda

By piaras

We talk sometimes about the consolations of philosophy and I think a similar notion can apply to the practice of writing code. I reckon I’m not the only one who derives some measure of contentment from spiking a simple tool or technique away from the pressures or realities of the boiler-room.

Of course, working a solution through from end-to-end should provide some learnings that you can carry forward to improve your general skill-set. But shouldn’t a practice entail more than the constant repetitious circuits of self-improvement?

The process of completing a discrete self-contained chunk of software might also reward you with a tasty slice of validation pie. Getting something simple finished and working can be an end in itself sometimes and you may find the process can be comforting and reassuring; it may help tend and sustain your love for the craft. This short write-up is the product of such code consolations.

If you find yourself working on an application that has a requirement for users to submit packages of files for ingest into a system then Zip archives and an object store such as S3 will be a natural (and user-friendly) choice. But once a user has submitted the ingest package what next? It feels ugly to download the entire thing to disk, unzip and then re-upload each file to the object store. Double ugly if the only local operation we perform is the decompression of the archive file.

Let’s look at a quick method for a stream decompression of the zip archive using Golang and for the sheer hell of it we’ll throw deployment to AWS Lambda into the bargain. The streaming approach works particularly well in a serverless context owing to the memory quota applied to Lambda functions. Although the resource ceiling on this has recently been raised pretty significantly, if we can unzip the file using a pipe we shouldn’t have to worry about memory constraints period (it’s worth pointing out that this approach can work equally well in a VM or containerized runtime scenario and offers the same benefits).

The guts of our code will consist of two functions running concurrently: one to read data from a remote file and write it to a pipe; and a second function that will read data from the pipe, process it and write the result to a remote file.

We can leverage the AWS Golang SDK to manage the upload and download of files and io.Pipe from the Go standard library to transit the bits from one location in our program to another. However, in order to use io.Pipe with the AWS SDK we need to ensure our pipe satisfies the io.WriterAt interface, which we can achieve thusly:

Tip of the hat to Dávid Mikuš for this insight

Here we extend io.Writer with a custom WriteAt method which we can then use to wrap our io.PipeWriter and pass safely into the S3 download manager:

Using AWS S3 download manager with io.Pipe

Now that there is data in the pipe we can start processing it. The zipstream package takes an io.Reader meaning we can plug the io.PipeReader directly in and start reading from the pipe:

Reading data from io.Pipe using zipstream

Thezipstream package provides us with the header and bytes for each file. We can use these to construct the PutRequest for S3:

Writing data from zipstream Reader to S3 via upload manager