How does JPEG actually work?

By MSDN Archive

JPEG is an image encoding designed to compress photographs and similar images effectively, often 5 to 15 times over a raw bitmap format. It's a lossy format that exploits properties of human vision to eliminate information that is difficult to distinguish. You see JPEGs all the time, and if you've seen a JPEG compressed with a low quality level, you have a good idea of the kinds of visual artifacts that emerge when JPEG compression is pushed to its limits. But how does it

really work, and how does this cause JPEG artifacts?

The JPEG standard defines a number of different encoding options, and there are a number of container file formats defined other than the usual JFIF, but I'm going to discuss a single popular JPEG encoding method used by many applications. This encoding combines three different completely independent methods of image compression: downsampling the chrominance channels, quantization of the discrete

cosine transformation, and entropy coding.

Method 1: Downsampling the chrominance channels

The most obvious way to make an image file smaller is to make the image smaller - for example, a 100×100 bitmap is 1/4 the size of a 200×200 bitmap. You can expand the image back out to the original size when viewing it by simply upsampling it. This isn't a great compression method, since it causes clearly visible artifacts, usually blocky or

blurry effects, especially around detailed, sharp areas.

A different approach is instead of downsampling the entire image, to only downscale certain channels of the image. For example, the eye's red-green sensors are more numerous and sensitive than its blue-yellow sensors, so if we shrink just the blue channel and later re-expand it, the effect is much less noticable. But the RGB color space is not our only choice - images can be encoded in many color systems, and for each of these the channels we choose to downsample will have different

visual effects. This experiment

by Eric Setton of Stanford University demonstrates the effect on image

quality of downsampling various channels in various color systems.

The results are visually obvious: we are sensitive to red and green, but even more sensitive to intensity or luminance, the relative brightness of pixels compared to other nearby pixels. This is expressed by the Y component of the YCbCr system and the V component of the HSV system. Any downsampling of this information is visually disastrous. On

the other hand, downsampling the chrominance, expressed by the

Cb and Cr components of the YCbCr system, has almost no visual effect

(Cb and Cr stand for "chrominance blue" and "chrominance red" - see a sample image broken up into YCbCr channels in the middle of this page).

For this reason, typical JPEG files convert images to the YCbCr color space, and then downsample both the Cb and Cr channels by a factor of 2 in both width and height. Further processing of the image proceeds on

each of the three channels independently.

This is the method in JPEG encoding largely responsible for the artifact known as "color bleeding", where along sharp edges the color on one side can "bleed" onto the other side. This is because the chrominance channels, which express the color of pixels, have had each block of 4 pixels averaged into a single color, and some of these

blocks cross the sharp edge.

Method 2: Quantization of the discrete cosine transformation

Normally, we view an image as simply a grid of pixels, and if we alter a pixel or group of pixels dramatically, the effect will be visually obvious. A different approach to storing images is to come up with a set of "basis" images which can be added together in various combinations to form the image we want. JPEG encodes each 8-by-8 square

of pixels using the following basis image set of 64 images:

We assign each of these images a relative weight determining how much influence it has on the image. This weight can be positive or negative (when negative, the basis image is subtracted). Taken together, these 64 weights can describe any 8×8 image at all, and they

describe the image precisely. This example at Wikipedia

shows an 8×8 image, its original pixel array, and its new 8×8 array of weights after being converted using the discrete cosine transform (which I'll discuss later).

But what's the point of transforming 64 pixel values into 64 weights? For one thing, the resulting data may exhibit more structure, and so be easier to compress. For example, with blocks from typical photographs, the basis images in the upper left above have large

weights and the ones in the lower right have small weights.

But to store the weights, which are real numbers, we have to encode them somehow. The simplest encoding is to just convert them to integers

between -k and k by scaling and rounding them. This works, but there's a tradeoff in our choice of k:

if it's too big, the encoded image will take too much space, but if it's too small, the rounding might visibly distort the image, since it's effectively altering the weights.

A more effective approach is to use a different k for each basis image. This works well because the visual difference caused by variations in the weights of different basis images is different. The eye can detect small variations of the weights of the images in the upper-left, but can overlook much larger variations in the weights of

the images in the lower-right, so we can get away with smaller k and larger rounding error for these weights.

In practice, rather than deal with expensive floating-point numbers, we round the initial weights to integers and later divide them by fixed

factors called the quantization factors using integer division.

During decoding, we multiply them by these factors again. The upper-left images have small quantization factors, so that little information is lost in this process, while the lower-right images have

larger factors. This example shows a commonly used matrix of quantization

factors, one for each basis image, and the result of quantizing the sample image using them. Observe how much simpler the matrix becomes - it now contains a large number of entries that are small or zero,

making it much easier to compress. This example

shows how the weight values expand back out during decoding - notice that although some of the weights are now different, the visual difference is almost undetectable.

Quantization is the primary source of JPEG artifacts. Because the images in the lower-right tend to have the largest quantization divisors, JPEG artifacts will tend to resemble combinations of these images. The matrix of quantization factors can be directly controlled by altering the JPEG's "quality level", which scales its values up or down.

Quantization compresses the image in two important ways: one, it limits the effective range of the weights, decreasing the number of bits required to represent them. Two, many of the weights become identical or zero,

improving compression in the third step, entropy coding.

But this discussion leaves one important question unanswered: how do we take an arbitrary square of 8×8 pixels and decompose it into a combination of the basis images? As it turns out, we specifically chose the images above in a way to make this easy. These images are formed by cosine waves of various frequencies in the x and y directions, and a

transformation called the discrete cosine transform from Fourier transformation theory allows us to rapidly decompose a signal into these parts in only O(n log n)

time. Because of the importance of JPEGs, over the years very fast and clever specialised solutions for 8×8 blocks have been created in both software and hardware. This is also the reason we use small, constant-sized blocks instead of processing the image as a whole: the cosine transform step doesn't need to scale up to large bitmaps, so the

transformation as a whole can run in linear (O(n)) time.

Method 3: Entropy coding

Unlike the other steps, entropy coding is lossless. If you've studied the simple compression techniques used in ZIP files and BMP

files, the techniques of entropy coding should come as no surprise.

We start by turning our 8×8 grid of integers into a linear sequence of integers. The obvious row-major and column-major orders are not ideal here, since we expect there to be many zeros in the lower-right area. Instead, we start with the top-left corner and zig-zag in a diagonal pattern across the matrix, going back and forth until we reach

the lower-right corner.

Next, we apply the same run-length encoding algorithm used by GIF files to condense sequences of identical integer values. In areas where the values are small, we can expect many such runs, particularly runs of zeros. We use a special code to say "all the remaining values are

zero", which comes in handy.

Finally, we apply Huffman compression to the remaining sequence, which is a standard lossless compression technique that encodes

frequently used values using less bits than infrequently used values.


And that's it! For each YCbCr channel, for each 8×8 block, we perform these transformations, and we store the results in a bitstream. This data is wrapped up in a JFIF container giving metadata about the image, and the file goes out ready for web browsers to view, completely unaware of all the data that has been lost. Just don't go sending JPEGs

into space and expect aliens to see them the same way.

Finally, no discussion of the JPEG encoding would be complete without noting that JPEG is based on research ideas that are now very

old and largely supplanted. The more modern JPEG 2000 encoding

based on wavelets can be pushed much farther before visible artifacts appear, and includes a surprisingly efficient lossless mode. Of course, all this improvement is rather moot until common applications actually

support it.