Posted by: The TensorFlow MLIR Team
The TensorFlow ecosystem contains a number of compilers and optimizers that operate at multiple levels of the software and hardware stack. As a day-to-day user of TensorFlow, this multi-level stack might manifest itself as hard-to-understand compiler and runtime errors when using different kinds of hardware (GPUs, TPUs, mobile).
These components, starting from the graph, could be summarized like this:
In this diagram, we can see that TensorFlow graphs can be run a number of different ways. This includes:
- Sending them to the TensorFlow executor that invokes hand-written op-kernels
- Converting them to XLA High-Level Optimizer representation (XLA HLO), which in turn can invoke the LLVM compiler for CPU or GPU, or else continue to use XLA for TPU. (Or some combination of the two!)
- Converting them to TensorRT, nGraph, or another compiler format for a hardware-specific instruction set
- Converting graphs to TensorFlow Lite format, which is then executed inside the TensorFlow Lite runtime, or else further converted to run on GPUs or DSPs via the Android Neural Networks API (NNAPI) or related tech.
In addition, there are other even more sophisticated paths, including multiple rounds of optimization within each layer, such as the Grappler framework that optimizes tensor layout and operations in TensorFlow today.
While these numerous compiler and representation implementations substantially improve performance, this heterogeneous world can cause issues for end users, such as producing confusing error messages at the boundary between these systems. Also, new hardware and software stack creators must rebuild optimization and transformation passes for each new path.
With all this in mind, we’d like to announce MLIR, or Multi-Level Intermediate Representation. This is a representation format and library of compiler utilities that sits between the model representation and low-level compilers/executors that generate hardware-specific code. With MLIR, we want to enable novel explorations in optimizing compiler design and implementation, backed by production quality components.
We expect MLIR to be of interest to many groups, including:
- Compiler researchers and implementers looking to optimize performance and memory consumption of machine learning models
- Hardware makers looking for a way to connect their hardware to TensorFlow, such as TPUs, portable neural hardware in phones, and other custom ASICs
- People writing language bindings that want to take advantage of optimizing compilers and hardware acceleration.
MLIR is, at its heart, a flexible infrastructure for modern optimizing compilers. This means it consists of a specification for intermediate representations (IR) and a code toolkit to perform transformations on that representation. (In compiler parlance, as you move from higher-level representations to lower-level representations, these transformations can be called “lowerings”, and we’ll use that term ahead.)
MLIR is highly influenced by LLVM and unabashedly reuses many great ideas from it. It has a flexible type system, and allows representing, analyzing and transforming graphs combining multiple levels of abstraction in the same compilation unit. These abstractions include TensorFlow operations, nested polyhedral loop regions, and even LLVM instructions and fixed hardware operations and types.
To separate different hardware and software targets, MLIR has “dialects”, including:
- TensorFlow IR, which represents all things possible in TensorFlow graphs
- XLA HLO IR, which is designed to take advantage of XLA’s compilation abilities (with output to, among other things, TPUs)
- An experimental affine dialect, which focuses on polyhedral representations and optimizations
- LLVM IR, which has a 1:1 mapping between it and LLVM’s own representation, allowing MLIR to emit GPU and CPU code through LLVM
- TensorFlow Lite, which will translate to running code on mobile platforms
Each dialect consists of a set of defined operations which have invariants placed on them, like: “This is a binary operator, and the inputs and outputs have the same types.”
MLIR has no fixed/built-in list of globally known operations (no “intrinsics”). Dialects can define entirely custom types, which is how MLIR can model things like the LLVM IR type system (which has first class aggregates), domain abstractions important for ML-optimized accelerators like quantized types, and even the Swift or Clang type systems (which are built around Swift/Clang declaration nodes) in the future.
If you want to connect a new low-level compiler, you would create a new dialect and the lowerings between the TensorFlow Graph dialect and your dialect. This smooths the path for hardware and compiler makers. You can even target dialects at different levels in the same model; the higher-level optimizers will respect the unfamiliar parts of the IR and wait for a lower level to handle it.
For compiler researchers and framework makers, MLIR allows you to compose transformations at every level, and you can even define your own operations and abstractions in the IR — allowing you to best model the domain of problems you are trying to solve. In this way, MLIR is more of a pure compiler infrastructure than LLVM.
While MLIR acts as a compiler for ML, we also see it enabling the use of machine learning techniques within compilers as well! This is particularly important as engineers developing numerical libraries do not scale at the same rate as the diversification of ML models or hardware. The extensibility of MLIR facilitates the exploration of code lowering strategies and performing progressive lowering across abstractions.
We have opened the GitHub repository and welcome your interest (check out our tutorial!). We will be releasing more of this toolkit — including the specifications for both the TensorFlow and TF Lite dialects — in the coming months. We look forward to telling you more about this; for details see Chris Lattner’s talk from c4ml and our README on Github.
If you want to keep up on all things related to MLIR, please join our new mailing list which will be focused in the short term on announcements as we release more of this project. Stay tuned!
 In TensorFlow 2.0, graphs can be implicit; eager execution can run ops individually, in groups, or as full graphs (such as Keras sequential). No matter what, these graphs or graph fragments must be optimized and executed.