On Halloween this year I learned two scary things. The first is that a young toddler can go trick-or-treating in your apartment building and acquire a huge amount of candy. When they are this young they have no interest in the candy itself, so you are left having to eat it all yourself.
The second scary thing is that in the heart of the ubiquitous IMAP protocol lingers a ghost of the time before UTF-8. Its name is Modified UTF-7.
UTF-7 is described in RFC 2152. It lets you encode all of Unicode, much like the other UTF encoding schemes, though it adds a neat property: it only uses printable ASCII characters to do it. Unfortunately you pay a price: it is complicated and inefficient.
First, most ASCII characters are represented by themselves.
The important exception is the shift character
+ we now write
Any sequence of non-ASCII characters (or disallowed ASCII characters
~) are first converted to UTF-16BE,
then encoded as base64, and placed between a
+ and a
(Even though this is 2018, occasionally someone will try to claim in conversation with me that UTF-16 is better than UTF-8. The obvious response is to point to the surrogate pairs mess, but many people defending UTF-16 don't realize those are necessary. I have found I can skip over the long explanation of surrogates by simply asking: "do you mean UTF-16LE or UTF-16BE?")
There is something immediately appealing about this definition of UTF-7. You can describe it in three sentences, it is built on the popular encoding scheme base64, and it is ASCII printable.
"Hello, 世界" (UTF-8) "Hello, \u4E16\u754C" (ASCII with unicode hex literals) "Hello, +ThZ1TA-" (UTF-7)
UTF-7 is not a particularly appealing wire format.
In the example above UTF-7 uses 8 bytes to represent what UTF-8 does
in 6 bytes.
It becomes even less efficient if ASCII is regularly mixed in
with non-ASCII code points as we need to constantly add escape
And while it is ASCII printable, the printing is inscrutable.
ThZ1TA back to anything is beyond my mind, so I may
as well use something non-printable.
To make matters worse, this is not base64. It is modified base64.
The base64 padding character
= cannot appear in UTF-7.
To avoid it the RFC tells us to pad the UTF-16BE with zero bits
until you reach a length that can be base64 encoded without padding:
Next, the octet stream is encoded by applying the Base64 content transfer encoding algorithm as defined in RFC 2045, modified to omit the "=" pad character. Instead, when encoding, zero bits are added to pad to a Base64 character boundary. When decoding, any bits at the end of the Modified Base64 sequence that do not constitute a complete 16-bit Unicode character are discarded.
That sounds fishy.
Base64 encodes every block of 3 bytes to 4 bytes.
If what you are encoding is not divisible by 3 then what you
have is encoded and the base64 string padded so it is divisible
by four using
This means you may get up to two
= characters at the end of
a base64 string.
If we are going to pad the input as the RFC suggests so that we
never use =, we may have to pad up to two bytes of input with
That would form a valid UTF-16 NULL!
So how do we handle this padding?
I looked inside three UTF-7 encoders and found they don't follow
the RFC at all on this.
Instead, they encode the UTF-16 to modified base64 without any
zero bit padding, and then remove any base64
= padding from
This works and it produces shorter results with no ambiguous NULL than the RFC process. But it sure would be nice if someone had documented it.
To explain with an example, the initial base64 output for the string
We removed the trailing
== to produce UTF-7.
UTF-7 is no more. It has long since been replaced in SMTP and in MIME headers where many encodings can be used, people choose other things. However a modified version is still used in IMAP. RFC 3501 describes it:
Modified base64 is modified further, now in the encoded alphabet
/is replaced by
,. This is neither the standard nor URL base64 encoding scheme you have seen before.
The ASCII characters
~no longer need to be encoded. In fact, they MUST not be encoded.
The escape character is now
So now we have modified-modified-base64 and our example above reads:
"Hello, &ThZ1TA-" (Modified UTF-7)
A simpler future
IMAP is a living protocol with many RFCs adding extensions.
One of those is RFC 6855
which lets a server and client negotiate
and drop all the UTF-7.
It even includes a negotiation mode for the future where servers
UTF8=ONLY and refuse to talk any UTF-7 with clients.
Hopefully we can get there.