String’s ABI and UTF-8


We just landed String’s proposed final ABI on master. This ABI includes some significant changes, the primary one being that native Swift strings are stored as UTF-8 where they were previously stored either as ASCII or UTF-16 depending on their contents. NSStrings are still lazily bridged in to String without copying.

This does not immediately surface in the API, but allows for some important performance wins and gives a more consistent basis for future APIs providing efficient and direct access to the underlying code units. UTF-8 is a one-byte Unicode encoding and is the preferred encoding for interoperating with C, systems programming, server-side programming, scripting, client-side programming, and tools that process source code and textual formats.

Performance

Unifying the storage representation for ASCII and Unicode-rich strings gives us a lot of performance wins. These wins are an effect of several compounding factors including a simpler model with less branching, on-creation encoding validation of native Strings (enabled by a faster validator), a unified implementation code path, a more efficient allocation and use of various bits in the struct, etc.

C Interoperability

By maintaining nul-termination in our storage, interoperability with C is basically free: we just use our pointer. This means that myString.withCString { ... } no longer needs to allocate, transcode, and later free its contents in order to supply the closure with a C-compatible string.

Quantifying this improvement as an nx faster ratio: it’s either millions of times faster or error: division by zero times faster, depending on how you measure.

Decoding

Walking over and decoding the Unicode scalar values that comprise a string is much more efficient now.

Strings of Chinese characters are traditionally a worst-case scenario for UTF-8 decoding performance relative to UTF-16, as UTF-8 resorts to a multi-byte encoding sequence while UTF-16 just stores the scalar value directly as a code unit. This is even worse in reverse, because a continuation-byte in UTF-8 does not communicate the distance to the start of the scalar.

But, this isn’t really an issue: on modern CPUs this increase in encoding complexity is more than offset by the decrease in model complexity by having a unified storage representation.

Walking the Unicode scalar values forwards on Chinese text is now over 3x faster than before and walking in reverse (harder) is now over 2x faster. ASCII benefits even more, despite the old model having a dedicated storage representation and fast paths for ASCII-only strings.

Small UTF-8 Strings

Swift 4.2 introduced a small-string representation on 64-bit platforms for strings of up to 15 ASCII code units in length that stores the value directly in the String struct without requiring an allocation or memory management. With a unified code path that supports UTF-8, we’re able to enhance small strings to support up to 15 UTF-8 code units in length. This means your most important non-ASCII strings such as "smol 🐶! 😍" , can in fact, be smol.

We also added small-string support on 32-bit platforms, where we pack in strings of up to 10 UTF-8 code units directly into the String struct.

Miscellaneous

Operations over the UTF-8 view are (obviously) dramatically faster on native Swift strings: ~10x depending on the nature of the operation.

Character-based String modifications, such as String.insert(_:Character) are around 5-10x faster.

Improved normality checking makes String hashing 2-4x faster when the contents are already in NFC (which is the case most of the time).

Creating a String from UTF-8 contents ala String(decoding: codeUnits, as: UTF8.self) is around 5-6x faster.

Efficient Cocoa Interoperability

Efficient interoperability with Cocoa is a huge selling point for Swift, and strings are lazily bridged to Objective-C. String’s storage class is a subclass of NSString at runtime, and thus has to answer APIs assuming constant-time access to UTF-16 code units. We solved this with a breadcrumbing strategy: upon first request from one of these APIs on large strings, we perform a fast scan of the contents to check the UTF-16 length, leaving behind breadcrumbs at regular intervals. This allows us to provide amortized constant-time access to transcoded UTF-16 contents by scanning between breadcrumbs.

This is leveraged by String.UTF16View, so Swift code that imports Foundation and assumes constant-time access to the view also benefits.

We’ll be tweaking and tuning the granularity of these breadcrumbs and improving the scanning time, but this strategy has been proving sufficient for maintaining performance in realistic use cases.

For performance improvements in Cocoa interoperability, we’re working on some sweet bridging optimizations (simpler on a unified storage representation), but it’s too early to report back findings. We expect wins here to be far more important than a higher constant-factor on UTF-16 access.

Current Microbenchmark Issues

We landed with some known microbenchmark regressions that we knew we could fix with some elbow grease. We’re now applying elbow grease. Since this is such a substantial model change, it is far more important from a risk-management perspective to land this now to expose any unknown issues. Even so, net performance is substantially better.

We also have known gaps in our String benchmarking, which we will be closing and addressing any issues exposed.

Code Size

We haven’t started to tweak and tune code size, but this change already carries in some nice wins. A simpler model means less code and less reliance on heroic inlining for performance.

The stdlib binary is around 13% smaller with this change, which is a big win for Swift 5.0 applications that will back-deploy to pre-Swift-5 OSes. This also reduces memory usage and provides other system-wide benefits for post-Swift-5 OSes. The Foundation overlay is also around 5% smaller, as are others.

The source compatibility suite saw modest improvements, with an overall 2-3% shrinkage in total binary size. As I said, we haven’t started to tweak and tune, so this may improve more.

The Future of String Performance

Internal Improvements

We have many ideas for further performance enhancements to the internal implementation of String, such as:

  • Check for (or even guaranteeing) NFC-normalized contents upon creation, making canonical-equivalence comparison super fast

  • Cache more information on the storage class’s subsequent tail allocations, such as grapheme count and hash value

  • Perform fast single-scalar-grapheme checks and set relevant performance flags

  • Vectorize all the things, especially small strings

Low Level APIs

The most exciting aspect of the future of String performance is exposing low-level performant APIs. The unified storage representation allows us to expose low-level APIs on String that directly accessing the underlying storage. Previously, we’d have to expose a pair of each, one for ASCII storage and one for UTF-16 storage, and hope the developer remembers to test both paths. Now, we can expose something akin to the following (details/spellings for demonstration purposes only):

myString.withCodeUnits { codeUnitBuffer in // Access the contents as a contiguous buffer of `UInt8` // Awesome synergy with the character litarals pitch ...
} let str = String(withInitialCapacity: 42) { contentsPtr in // Initialize the string directly ... // Return the actual size we wrote in UTF-8 code units return actualSize // (UTF-8 validation is performed by String after closure is finished)
}

Of course, we need to figure out a strategy for communicating whether some existing String is native or a lazily-bridged NSString that does not provide contiguous UTF-8 contents. There are approaches with various tradeoffs: do the eager bridge, make everything optional, throw, trap, etc. Figuring this out will be the most important part of designing these APIs.

Shared Strings

The branch also introduces support in the ABI (but currently not exposed in any APIs) for shared strings, which provide contiguous UTF-8 code units through some externally-managed storage. These enable future APIs allowing developers to create a String with shared storage from a [UInt8], Data, ByteBuffer, or Substring without actually copying the contents. Reads would be slightly slower as it will require an extra level of pointer-indirection, but avoiding the copy could be a big win depending on the situation.

How You Can Help

While we are attacking our known-unknowns (regressions and gaps in the benchmark suite), we would really like to get early feedback on the new String ABI. If you encounter any issues or performance regressions, please let us know. I’ll update this thread when toolchains are available for download on swift.org.

Huge thanks to @lorentey, @lancep , @johannesweiss , @David_Smith, and @scanon for helping make this happen!

edit: Explicitly mentioned that NSStrings are still lazily-bridged in without copy.