WebAssembly as Meta-Language


WebAssembly as Meta-Language, Java-As-WebAssembly

Ben L. Titzer

2020-08-03

ABSTRACT

        WebAssembly offers a compact intermediate code format that allows applications to achieve portable performance competitive with native code. Despite success as a low-level target for unsafe languages, support for higher-level languages is still nascent. To support many languages, several low-level Wasm object systems have been discussed, the most prominent referred to as simply “the Wasm GC proposal.” This proposal offers statically typed structs, arrays, and first class functions, constructs out of which each language is meant to build its concepts, such as classes. This is a fundamentally reductive approach where a higher-level concept is reduced to a simpler lower-level set of mechanisms. The reductive approach works exceedingly well in most situations, yet has proved challenging for this particular problem. As decades of type theory show, it is difficult to design any low-level object type system that is efficient (little to no runtime overhead), universal (expressive enough to encode many high-level constructs), statically type-safe, and approachable (does not place too high a complexity cost on producers).

        This paper argues that because a purely reductive universal object model will always be fundamentally incomplete, failing to well support the next language’s features, the only long-term solution is proper layering. Language runtimes on top of Wasm must handle language-specific constructs without exposing implementation details. Layering solves a class of problems beyond and orthogonal to what is addressed in the Wasm GC proposal, indeed what could be addressed by any GC proposal, including the late-binding inherent in many language semantics, such as class inheritance across module boundaries. In this paper, we propose to achieve layering through parameterization, allowing Wasm to be extended with new values, types, and operators. In this paper we show how to add the missing parameterization mechanisms to Wasm’s existing import system, allowing Wasm to serve the role of a meta-language, i.e. a language which is used to encode other languages.  We motivate and explain the WebAssembly-as-meta-language concept by way of a detailed walkthrough of encoding the JVM bytecode set into Wasm. The encoding shows that the meta-language is enough to express all Java semantics in a way that is completely independent of implementation details.

        We show several implementation strategies for a simplified JVM, the Jawa Virtual Machine, on top of a Wasm engine extended with our proposed meta-language facilities. Jawa explores several completely novel points in the virtual machine design space, where each strategy represents different patterns of cooperation and trust with the Wasm engine. Jawa is extreme in its reuse of the Wasm engine’s mechanisms, which keeps it incredibly simple. Surprisingly, it has none of the classical components of a virtual machine--no interpreter, no JIT compiler, no classloader, no garbage collector. It is little more than a runtime stub generator, and yet it can implement an entire JVM. This, we believe, is significant progress towards the holy grail of language implementation by reusing an existing, high performance VM with minimal changes. We show five different implementation strategies, including one that targets the Wasm GC proposal, one that does not, and one that needs absolutely no support from the Wasm engine at all. Thus we believe the meta-language concept is fully independent of implementation choices.

INTRODUCTION

        At its core, Wasm is an architecture-independent virtual machine with primitives such as integers, floating point numbers, and large, byte-addressable memories. Bytecode instructions map fairly directly to hardware instructions. These low-level constructs serve as a minimal abstraction over hardware. For structuring programs, Wasm offers functions and modules but no higher-level concepts such as objects or closures that are typical implementation mechanisms for higher-level languages. Instead, to target Wasm today, higher-level languages must lower their constructs as when they target native hardware architectures, laying out their data structures as raw bytes in memory and expressing all accesses to them as pointer loads and stores. For languages that have existing native implementations, retargeting to Wasm is relatively easy and offers a portability and platform advantage. However, implementations aren’t much simpler; they must still make detailed representation decisions in lowering to Wasm. Languages without existing native implementations face the entire lowering problem with no help at all.

        Garbage collection support is a necessary platform service that must be added to Wasm. Many higher-level languages employ automatic memory management with garbage collection, and though many Wasm engines are part of JavaScript virtual machines that do have sophisticated garbage collectors, those GCs are not exposed in Wasm itself. Instead, languages must bring their own garbage collectors and treat Wasm memory as if it were machine memory, manually finding roots on the stack and tracing through the linear memory. If every one of dozens or hundreds of languages were required to bring their own garbage collector, total system complexity is large and there is much duplication of effort. Further, many lifetime problems ensue where correct and efficient reclamation of memory depends on cooperative memory management in systems not designed to cooperate. Circular references between a managed and unmanaged memory space or between two managed spaces that use handles can be particularly troublesome. The complex interactions inevitably lead to memory leaks, as typified by the long story of the interplay between V8 and Chrome’s memory management[1].

        We must raise the level of abstraction with automatically managed data structures. The current Wasm GC proposal is designed to offer a new set of medium-level mechanisms for higher-level languages to target Wasm. Instead of laying out data structures as raw bytes and pointers, language implementations can use statically-typed structs, arrays, and closures to build higher-level concepts such as classes, objects, and method dispatch. This slightly higher level of abstraction reduces (but does not eliminate) the amount of lowering work. The rest of the lowering problem is solved by the Wasm engine which maps medium-level concepts to hardware, reducing duplication of effort and allowing for hardware-specific optimizations that only the Wasm engine understands how to perform. But better, it completely eliminates the need for a language implementation to provide a memory management solution. In much the same way that Wasm benefits from the sophisticated JITs in Web VMs, gaining excellent performance yet marginal added complexity, we expect to leverage their sophisticated garbage collectors now too.

        The Web Platform Value-add of WebAssembly GC is an efficient statically-typed object model. Being born of and still mutually benefiting from the Web Platform, Wasm should strive to continue to provide benefit to that platform. While in theory a particular embedding should rarely provide hard design requirements on core Wasm, it is a stark reality that Wasm’s GC design must provide a value-add to the Web Platform. In particular, the Web Platform already contains a complex, dynamically-optimized object system in the form of JavaScript. As such, many languages have completely adequate compile-to-JavaScript strategies that Wasm should not seek to displace for the sole reason of increasing its market share. It must offer more value. Rather, there are many languages with completely inadequate compile-to-JavaScript strategies, or no strategy at all, that should be served by new Wasm mechanisms. Most of the shortcomings from compiling to JavaScript come from inefficient object representation because few things can be statically typed. Thus we believe the Wasm GC proposal needs to provide object representations that are as efficient or nearly as efficient as most languages’ native implementations.

THE LATE BINDING PROBLEM

        Many higher-level languages target virtual machines and compile code to a platform-independent format such as JVM class files, DEX files, or .NET assemblies which contain source-level constructs such as classes, methods and bytecode. Bytecode references source-level concepts such as fields and methods by name rather than by memory address or field offset, enabling separate compilation. Class inheritance, method overriding, method overloading, and other details are resolved when code is combined at link time, which for most of these languages is during execution. For Java, class loading and initialization is observable because it is mixed with the dynamic execution of the program. Despite attempts at transparent on-disk caching and many tries at offline optimization, the JVM platform is fundamentally based on late binding, a fact made stark by the existence of a dynamic code generation facility and set of reflection mechanisms. The Android platform, .NET platform, and others all have the same or similar mechanisms.

        Late binding of lowered code creates ABI problems where there were none before. For languages like Java and C# where separate compilation allows different parts of the program to be compiled independently, lowering too early breaks apart and even discards the very source-level information necessary to make late binding work properly. For example, consider two Java classes A and B. Suppose A is defined in a library, B is defined in a user program, and B extends A. B might inherit fields defined by A and add its own fields. When lowered to either a machine or even a medium-level object model, the language implementation must decide, separately, the placement of fields within instances of B and A. The typical, most efficient implementation strategy is to append the fields of B after the fields of A so that instances of B are memory-layout-compatible with instances of A. Field accesses are then translated to static offsets into objects. There is then no execution overhead for using an instance of B in place of A. However, now there is a dependency because B’s layout cannot be computed until A’s is known. This is the first and most obvious ABI problem created by lowering too early. Because the field offsets are static immediates to instructions (i.e. they are not computed at runtime), it would necessitate a mechanism to import them across a Wasm module boundary. Of course, that is not the only solution. Another approach would be to allow B to declare a nested, inline A as its first field. Yet here again, the size of A is not known across the module boundary, nor what fields it contains, so the import mechanism needs to be able to declare constraints on A’s definition. There are many other variations on these basic schemes, trading off performance for flexibility. They are all trying to work around ABI problems that are created by the wrong approach of lowering too early.

        Late binding of lowered code breaks semantic guarantees of source languages.  It is not just field prefixing and object sizes that need to be resolved. Simply consider a Java class A that declares two different fields f and g, both of type int, which maps to Wasm’s i32. Across a module boundary, it is not enough to know that A has two fields of type i32. The user module also depends on the mapping of source-level names to implementation-level offsets; i.e, whether f comes first or g comes first. That of course creates a worse ABI problem, since the order could change for a number of reasons. To solve that, we would need to adopt a further ABI convention, e.g. sorting fields according to some criteria. But no ABI convention can resolve the problem using a scheme of offsets and indirection alone. To see why, consider that a user module needs to know that these two fields are actually f and g and that nothing unexpected has happened, such as g has been deleted and a new int/i32 field h added in its place. Such a modification is an IncompatibleClassChangeError in Java, but might be a dynamic error in another language. In either case, gone unnoticed, the user module will silently misbehave instead of crashing, believing that field h is field g. The general problem demonstrated here is that user modules can become confused about definitions in other modules in ways that pass the Wasm type system but are completely wrong according to their language semantics. Solving this particular problem requires all fields’ source names be inspected at link time. The obvious conclusion is that, fundamentally, correct late binding requires source-level information.

        No, really, late binding of lowered code breaks semantic guarantees of source languages. It’s tempting to design specific mechanisms to combat individual late-binding problems one by one. For example, there have been Wasm proposals to add a mechanism for importing struct definitions in other modules that allows their fields to be properly linked, essentially late-binding their offsets. It has been further suggested that maybe a limited, specific kind of inheritance mechanism could be added to express this pattern more succinctly. Most such suggestions already start to break down when we consider naming issues like the one already pointed out. But these approaches are doomed because inheritance in many languages is not just based on names, but often types as well.

Consider Java single-inheritance between classes. This is typically implemented by each object pointing to a vtable, which is essentially a struct containing function pointers. Like object field prefixing, vtables have the same set of problems with late binding offsets, except the offsets are not for object fields, but the function pointers in the vtable. It’s worse now because which method in a subclass overrides its superclasses’ methods (i.e. which slot in the vtable it should map to) is not based just on names, but the entire method signature. No name mangling or scheme of offsets can express that in Wasm.

Worse problems are waiting after that. A malicious or buggy program P can misbehave in new ways. For example, P could intentionally skip a required virtual dispatch and instead just directly call a super class’s implementation of a method. That call will type-check at the Wasm level and will even run without dynamic type errors, but it will instead silently misbehave, perhaps in a way that is a security vulnerability. P has violations of source-level guarantees that cannot be checked by Wasm, because it has access to implementation details that were impossible to use or misuse before. If P’s problems are innocent and arise because of a buggy source compiler, programmers are completely mystified as they cannot see bugs in P’s source code, the compiled P typechecks and runs without crashing, but nevertheless, it behaves oddly. Essentially, language-level semantics, and thus security mechanisms, cannot be trusted; not the ones in the language itself, nor the ones application and library programmers try to build. Again, only correct late binding can maintain language invariants, and that requires source-level information.

        This problem is not specific to any proposed lowering constructs. It is tempting to consider solving the problems created by lowering a specific language to a specific set of target constructs by increasing the expressive power of that target. For example, we have seen examples of lowering strategies that lose source-level semantic information, but perhaps we need to just keep adding binding mechanisms until those languages are done? Or perhaps we need to throw away our current set of lowering constructs and find another? Is there eventually a set of constructs that do not lose source information? Well that is clearly not the case, since we see that each time we consider a new language with a new construct we haven’t seen before, we are presented with a new kind of binding problem. In general, the “full abstraction problem will always remain”: new programming languages will continue to invent crazier and more bizarre constructs that don’t map onto the constructs that Wasm provides. While they can and will lower to a universal set in order to simply execute, they will need late binding constructs that cross module boundaries that contain source information.

ENTER WASM-AS-META-LANGUAGE

        This paper argues that solving the late-binding problem requires expressing source-level constructs from arbitrary programming languages in Wasm itself. To serve that purpose, Wasm must have a mechanism to encode source-level constructs without losing any of their semantic information. Such a mechanism is by definition a language for describing languages, i.e. a meta-language. While that might sound abstract, bizarre, inefficient, a big change—it isn’t any of those. In fact the very concrete, straightforward, and efficient Wasm we have today needs only small additions. It won’t require many new constructs nor forego existing efficient representations, static or dynamic optimization strategies, or code caching. We simply need to generalize a little.

        Some Wasm constructs for manipulating unknown values already exist. First, Wasm already has a mechanism for describing unknown computations: imported functions. Such functions are given names in a two-level namespace and a signature of input types to output types, allowing code within the module to be typechecked according to those signatures. Imported functions can come from another module or from the host environment, such as an IO API or Web API. They are part of the same index space of functions as those defined in the module and are thus interchangeable with “native” Wasm functions. Second, Wasm already has a mechanism for describing unknown values from its host environment: importing of immutable global variables. It also has a mechanism for describing a (single) unknown type: the externref type. These constructs allow code within a module to traffic in opaque values from the host environment. With function imports, code in a module can execute operations on the opaque values obtained through global imports (or the return value of other imported functions) by simply calling the imported functions with those values as arguments. In effect, these mechanisms allow programs to drive a host API as if it were a language extension to Wasm itself.

        We need a more powerful type import mechanism. While the addition of the Wasm externref type allows Wasm modules to talk about values from the host environment, there is no way to distinguish between different types from the host environment. In particular, host APIs often have lots of different classes or interfaces, and externref maps them all onto a single type. Without distinguishing host types statically, host APIs are forced to dynamically check the types of arguments they receive from Wasm programs. That is not only an efficiency issue, but comes with all of the normal disadvantages of any dynamically typed API, not the least of which is checkable documentation. Another disadvantage of the externref type is that it exposes the representation of a host environment’s values, because all externref values are assumed to be pointers to on-heap data structures, and variables of type externref are assignable from the (one) null externref value. In general, there is no reason to expect that every host environment that exposes values to Wasm will expose only references. Many programming languages already have constructs for exactly this scenario in the form of type parameters. Programming languages that allow any type to be a type argument demonstrate the power of this abstraction: code within the scope of the type parameter has no dependency on the representation of values of that type. Wasm needs a type parameterization mechanism that is similarly abstract where the module has absolutely no knowledge nor dependence on the representation of values of an imported type. While it would seem that the type imports proposal would fit the bill, currently that proposal only allows type definitions (i.e. struct, array, or function signatures) to be imported. Type definitions are not types; they can only be used by the ref and ref null type constructors to create types. Moreover, type definitions are Wasm-level constructs, not source-level constructs. As before, that is not enough. We need a mechanism to describe arbitrary, unknown value types from an unknown programming language, with arbitrary, unknown value representation and hereby propose to add that to Wasm. Of course, there will be implementation issues because an engine needs to eventually know the machine representation of type parameters in order to generate machine code, but we will get to that.

        To complete the picture, we need to parameterize imports themselves. While it might not be immediately clear from the outset, we will see that in our case study of Jawa, detailed in the next section, imports of values, types, and operators need to be parameterized over Wasm types and functions. The need is not specific to Jawa. In general, a programming language might have type parameters (i.e. has types that they themselves can be parameterized over other types), or it might have classes (whose definition consists of a number of function bodies which are defined as Wasm functions), or it might need to provide default values of a particular type (i.e. an import of a global of a particular, imported type).

        Values, types, and operators are the only key extensibility dimensions. Programming languages are extremely diverse and an essentially infinite design space. How can we design a meta-language that we are sure captures all dimensions of extensibility? After all, this paper has spent multiple pages arguing that a reductive approach cannot solve the late binding problem because that inherently requires the preservation of semantic information from source languages. Do values, types, and operators cover the entire space of extensibility for programming languages? Well, no. Instead, this paper argues that these dimensions are the only key extensibility dimensions remaining if we can get a complete set of control and data flow abstractions. In particular, since Wasm has modules, functions, local variables, global variables, tables, branches, loops, call stacks, exceptions, and possibly stack switching, we argue that the only remaining dimensions of extensibility are the ability to add new values to express new kinds of data structures, new types to organize those values into statically checkable sets, and new operators on those values. We believe all three of these can be encoded into Wasm’s import and export mechanisms. Later, we will see how and where the reductive approach is applied, since these new constructs will need to be lowered to run on Wasm.

        Is the import/export system really the right place to encode the meta-language? As argued up to this point, we must represent source-level concepts in Wasm in order to achieve late binding. But where, specifically, and how specifically, should we encode it? The Wasm binary format offers custom named sections, where any arbitrary data can be inserted into a module. Tools, and particularly engines, generally ignore these extra sections. It would seem a natural place for additional data that isn’t meaningful to engines. However, we observe that this source-level semantic information may not be useful to the execution engine itself, but it is useful to the relations between modules. In fact, it is vital, as it is part and parcel to the process of late binding. The import mechanism is the natural place to encode this semantic information because it is how a module describes its requirements of the outside world. Not only is the import mechanism just a tempting place to add the meta-language, the meta-language requires parameterization over Wasm module elements like types and functions. That poses a problem for custom sections, because references to module concepts cannot be encoded into raw bytes without becoming opaque to normal module transformations such as dead code elimination or combining modules. By encoding the meta-language into the import/export mechanism, it becomes possible for general module transforms to compose without needing to inspect custom sections looking for opaque references to Wasm module elements.

JAWA (Java-as-WebAssembly)

        It is time to get ruthlessly concrete.

        Wasm must be able to implement large, complex, high-level programming languages well. To motivate that endeavor, we chose the Java virtual machine as a sufficiently important and challenging candidate. The Java virtual machine is the primary compilation target not only for the Java source language, but for a whole ecosystem of JVM languages. Thus, focusing on implementing the JVM as opposed to just the Java source language brings the entire ecosystem. As can be seen from the prior examples in this paper, Java is a class-based object-oriented language that requires late binding of constructs. These binding problems are typical of languages with single-implementation inheritance and multiple interface inheritance. Java also has a very large class library, the Java Development Kit (JDK), that includes tens of thousands of classes. Because the JDK is so large, it saves space and build time to compile separately from applications and be provided by the runtime environment, rather than each application packaging the JDK separately. Further, there is tight coupling between the JDK and the JVM as there are many classes that are integral to the source semantics of Java. Thus the JDK is effectively inseparable from the JVM, and cannot be bundled with an application. The late binding problem is basically inherent to the JVM model.

        Java source code is compiled to class files, each of which contains a binary representation of a single declared source-level class or interface. References to all other classes, including the superclass of a class, are encoded into a table where UTF-8 strings contain the original, unmangled source names. Fields and methods are encoded similarly, which include not only their names but their full source-level declared types, in order to determine method overloading and overriding. Bytecode methods correspond nearly one-to-one to source methods. Within a method, control flow is expressed as jumps and switches, and dataflow is expressed with statically-indexed, statically-typed local variables and an implicitly but statically-typed operand stack.

        The JVM bytecode set constitutes the complete set of Java operations available to a program. It offers primitive types such as fixed-sized integers and floating point numbers and a set of familiar arithmetic and comparison bytecodes. All of these map easily onto hardware instructions—and even onto Wasm’s arithmetic bytecodes. Thus for our purposes we can consider the JVM primitive arithmetic to be uninteresting. The remaining set of bytecodes that deal with Java objects, arrays, and method calls are the interesting ones and require the use of a meta-language to describe.

        Exceptions on the JVM are handled by way of a side-table attached to the bytecode of methods. The side-table is a list of bytecode offset ranges and associated exception handler offsets. The JVM has a one-phase exception handling strategy; i.e. throwing an exception unwinds the call stack until a handler whose range covers the throw or call site is found. The type of exception thrown is not part of the matching criteria. Instead, user code in an exception handler may perform dynamic instanceof tests to determine whether to handle an exception, rethrowing it if not matched. We believe this mechanism can be mapped onto Wasm’s proposed exception handling mechanism just as normal local control flow is mapped. Thus for the purposes of this paper, throwing and catching Java exceptions is uninteresting.        

Jawa Imports: Types, Globals, Functions, and...Commands?

        With the stage set, we can now begin describing how we encode Java into Wasm’s upgraded import/export meta-language. This encoding we will refer to as Jawa (i.e. Java-as-WebAssembly) and the runtime system necessary to link and execute as the Jawa Virtual Machine. The Jawa VM is not a traditional virtual machine in the sense that it contains an interpreter, bytecode loader, verifier, or even JIT compiler. The Wasm engine already contains all of those things! Instead, the Jawa VM only needs to maintain an internal dictionary of classes and process Jawa imports one at a time.

        The Jawa Virtual Machine is an import processor.  All imports in Wasm have two names: a “module” string and a “field” string. Despite the second name being called “field”, imports can be of a type, a function, a global, a table, or a memory, and the kind is specified as part of the import. To use Jawa, a Wasm module simply uses the literal string “jawa” as the module name of the import. The Jawa VM is installed in the Wasm engine so that all imports from the “jawa” module are serviced through it. Other imports in a Wasm module are ignored by Jawa. Unlike other Wasm modules, the Jawa module does not have a finite set of field names, but instead parses and interprets a meta-language encoded in the field name. For type imports, Jawa must parse the name and provide a type to a module, either by looking up the Wasm-level type for a class in the internal dictionary or by creating a new Wasm-level type. For function imports, which we will see correspond to the “interesting” JVM bytecodes, it must provide a callable function. For global imports, it must supply a global variable initialized to a constant (either a class object or string). Finally, we introduce a new kind of “side-effect only” import that is used for finishing class definitions.

        Abstract Type imports. As we have proposed here in this paper, type imports are a new mechanism by which a Wasm module can import an abstract type from the outside world, either another module or the host environment. Abstract type imports can then have a list of subtype bounds. The bounds describe subtype relations that must hold for any type that would ever be supplied as the argument to an instance of this module. With type bounds, the code within a module can be type-checked independently of any binding of actual types to type imports.

        Jawa Type Imports.  Type imports in Jawa can name built-in types like primitive arrays, classes by their name, interface types by their name, or arrays of a reference element type. The shape of the type must be encoded into the name of the import. We chose a binary encoding where the first byte indicates the shape of the type, then immediates may follow. But encoding immediates into the string is not enough. In Jawa type imports we are not just importing types but we can actually apply type constructors according to the rules of a small type language. For example, in Java, we can build the type T[] from the type T, thus in Jawa we can pass a reference type argument to a type import which applies the array type constructor and produces a new type. But we must also be able to do more than just lookup types and apply type constructors, we must be able to declare new types for the classes a Wasm module would like to define. For example, to forward-declare that a module intends to define a new class, we (perhaps paradoxically) must import that type from the Jawa system. This is effectively requesting the system create a new Jawa class and type with our chosen name, super class, and implemented interfaces. Because Java classes can refer to their own type in their body--in fact, class bodies can be mutually recursive--we must forward-declare classes that will be defined later by a module so that we can use the forward-declared type in defining their bodies. We list all of the type import syntax below. In the table, we list the Type Import (hex code), which is the first byte of the field name, the Imm, or immediates, which are binary data following the first byte, the Import Parameters, which are Wasm module elements like previously imported types, the Type Constraints, which are Wasm-level subtype constraints, and a description.

Type Import (hex code)

Imm

Import Parameters

Typical Type Constraint(s)

Description

BYTE_ARRAY

(40)

<: $jawa/lang/Object

Import the Jawa byte array type, aka [B

BOOL_ARRAY

(41)

<: $jawa/lang/Object

Import the Jawa bool array type, aka [Z

SHORT_ARRAY

(42)

<: $jawa/lang/Object

Import the Jawa short array type, aka [S

CHAR_ARRAY

(43)

<: $jawa/lang/Object

Import the Jawa char array type, aka [C

INT_ARRAY

(44)

<: $jawa/lang/Object

Import the Jawa int array type, aka [I

LONG_ARRAY

(45)

<: $jawa/lang/Object

Import the Jawa long array type, aka [J

FLOAT_ARRAY

(46)

<: $jawa/lang/Object

Import the Jawa float array type, aka [F

DOUBLE_ARRAY

(47)

<: $jawa/lang/Object

Import the Jawa double array type, aka [D

REF_ARRAY

(4C)

`et: jawa ref type

<: $jawa/lang/Object

Import the Jawa array type which has `et as its element type, aka [`et

EXT_CLASS

(48)

name: string

<: `sc

<: `its

Import the external Jawa class named name (with assumed superclass `sc and interfaces `its)

EXT_INTERFACE

(49)

name: string

<: `its

Import the external Jawa interface named name  (with assumed interfaces `it)

DECL_CLASS

(4A)

name: string

`sc: jawa class type

`its: jawa interface type*

<: `sc

<: `its

Forward-declare and import a new class named name with the given superclass `sc and implemented interfaces `its

DECL_INTERFACE

(4B)

name: string

`its: jawa interface type*

<: `its

Forward-declare and import a new interface named name with the given implemented interfaces `its

        Jawa Global Imports.  Java class files can contain string constants and class constants. String constants in a class file are canonicalized so that equivalent string constants all refer to the same object. We achieve this requirement in Jawa by adding the ability to import immutable global variables and then encoding in the import name the characters of the string constant. Class constants are done similarly.

Global Import (hex code)

Imm

Import Parameters

Wasm Type

Description

CLASS_CONST

(50)

name: string

$jawa/lang/Class

Import a constant that refers to the class object for the given name

INTERFACE_CONST

(51)

name: string

$jawa/lang/Class

Import a constant that refers to the class object for the given name

STRING_CONST

(52)

value: string

$jawa/lang/String

Import a canonicalized string object representing the given string constant

        Jawa Function Imports. Jawa function imports are essentially all the “interesting” bytecodes from JVM class files that are left over after trivially translating primitive arithmetic, control flow, and exception handling. In fact, the mapping between JVM bytecodes and Jawa function imports is intentionally as one-to-one as possible. In much the same fashion as we encoded immediates to type imports into names, and in exactly the same fashion as how immediates are encoded into JVM bytecodes, we encode all of the semantic information about an operation directly into the import field name. The Function Signature supplied is a Wasm-level constraint on the signature of the function that should be returned by Jawa, i.e. the expected signature, and is no different than signatures for other non-Jawa imported functions. Just like type imports, many bytecodes are parameterized by a type, such as AALOAD, whose type argument is the type of the array. As before, the type argument must be explicitly supplied to the import because it cannot be encoded into the bytes of an immediate. Several bytecodes reference classes, fields, and methods. To support this, we encode the names of those classes, fields, or methods in basically the same way as is encoded in JVM bytecode, except there isn’t a central list of them like the JVM’s constant pool.

        For every function import, the Jawa VM must produce a callable function from the information in the Jawa VM’s class environment and the information contained in the import name and arguments. There are various strategies for implementing these functions which we will cover in a later section. All strategies end with the same result: each function import is bound to either a Wasm function or a host function that implements the execution of a single bytecode. There is no need for Jawa to reason about any other granularity of code except a single bytecode at a time. Thus, Jawa can be extremely simple. Of course, Jawa can and in some cases must provide a function customized to the immediates and type arguments of a given bytecode, but it never needs to reason about methods or control flow.

Function Import (hex code)

Imm

Import Parameters

Function Signature

Description

AALOAD

(10)

`at: jawa refarray type

[`at i32] ->

[`at.elem]

Null-checked, bounds-checked load from Jawa array object with reference element type

AASTORE

(11)

`at: jawa refarray type

[`at i32 `at.elem] ->

[]

Null-checked, bounds-checked, and type-checked store to Jawa array object with reference element type

ACMPEQ

(12)

[$jawa/lang/Object $jawa/lang/Object] ->

[i32]

Compare Jawa objects for reference equality

ANEWARRAY

(13)

`at: jawa refarray type

[i32] ->

[`at]

Allocate a new Jawa array object of reference element type

ARRAYLENGTH

(14)

`at: jawa array type

[`at] ->

[i32]

Null-checked load of the length of a Jawa array object

ATHROW

(15)

[$jawa/lang/Throwable] ->

[]

Throw a Jawa exception object

BALOAD

(16)

[$[B i32] ->

[i32]

Null-checked, bounds-checked load from a Jawa byte array object

BASTORE

(17)

[$[B i32 i32] ->

[]

Null-checked, bounds-checked store to a Jawa byte array object

CALOAD

(18)

[$[C i32] ->

[i32]

Null-checked, bounds-checked load from a Jawa char array object

CASTORE

(19)

[$[C i32 i32] ->

[]

Null-checked, bounds-checked store to a Jawa char array object

CHECKCAST

(1A)

`t: jawa ref type

[$jawa/lang/Object] ->

[`t]

Dynamic type-check of Jawa object

DALOAD

(1B)

[$[D i32] ->

[f64]

Null-checked, bounds-checked load from a Jawa double array object

DASTORE

(1C)

[$[D i32 f64] ->

[]

Null-checked, bounds-checked store to a Jawa double array object

DCMPG

(1D)

[f64 f64] ->

[i32]

Compare double values according to Jawa rules

DCMPL

(1E)

[f64 f64] ->

[i32]

Compare double values according to Jawa rules

DREM

(1F)

[f64 f64] ->

[f64]

Perform floating-point remainder

FALOAD

(20)

[$[D i32] ->

[f32]

Null-checked, bounds-checked load from a Jawa float array object

FASTORE

(21)

[$[D i32 f32] ->

[]

Null-checked, bounds-checked store to a Jawa float array object

FCMPG

(22)

[f32 f32] ->

[i32]

Compare float values according to Jawa rules

FCMPL

(23)

[f32 f32] ->

[i32]

Compare float values according to Jawa rules

FREM

(24)

[f32 f32] ->

[f32]

Perform floating-point remainder

GETFIELD

(25)

name: string

`ct: jawa class type

[`ct] ->

[`ct.fields[name].type]

Null-checked load from a Jawa object field

GETSTATIC

(26)

name: string

`ct: jawa class type

[] ->

[`ct.fields[name].type]

Load from a Jawa static field

IALOAD

(27)

[$[B i32] ->

[i32]

Null-checked, bounds-checked load from a Jawa int array object

IASTORE

(28)

[$[B i32 i32] ->

[]

Null-checked, bounds-checked store to a Jawa int array object

INSTANCEOF

(29)

`t: jawa ref type

[$jawa/lang/Object] ->

[i32]

Dynamic type-query of Jawa object

INVOKEINTERFACE

(2B)

name: string

sig:

jawa sig

`it: jawa interface type

[`it sig.params] ->

[sig.return]

Dynamic interface method dispatch and invocation including arguments

INVOKESPECIAL

(2C)

name: string

sig:

jawa sig

`ct: jawa class type

[`ct sig.params] ->

[sig.return]

Invocation of instance method including receiver object and arguments

INVOKESTATIC

(2D)

name: string

sig:

jawa sig

`ct: jawa class type

[sig.params] ->

[sig.return]

Invocation of of static method including arguments

INVOKEVIRTUAL

(2E)

name: string

sig:

jawa sig

`ct: jawa class type

[`ct sig.params] ->

[sig.return]

Dynamic virtual dispatch and invocation including arguments

LALOAD

(2F)

[$[B i32] ->

[i64]

Null-checked, bounds-checked load from a Jawa long array object

LASTORE

(30)

[$[B i32 i64] ->

[]

Null-checked, bounds-checked store to a Jawa long array object

MONITORENTER

(31)

[$jawa/lang/Object] ->

[]

Acquire Jawa object monitor

MONITOREXIT

(32)

[$jawa/lang/Object] ->

[]

Release Jawa object monitor

MULTIANEWARRAY

(33)

dims: int

`at: jawa array type

[i32 x dims] ->

[`at]

Allocate a new multi-dimensional Jawa array object

NEW

(34)

`ct: jawa class type

[] ->

[`ct]

Allocate a new Jawa object of the given class type

NEWARRAY

(35)

`at: jawa primitive array type

[i32] ->

[`at]

Allocate a new Jawa primitive array object

ISNULL

(36)

[$jawa/lang/Object] ->

[i32]

Compare an Jawa object reference against the null Jawa reference value

PUTFIELD

(37)

name: string

`ct: jawa class type

[`ct `ct.fields[name].type] ->

[]

Null-checked store to a Jawa object field

PUTSTATIC

(38)

name: string

`ct: jawa class type

[`ct.fields[name].type] ->

[]

Store to a Jawa static field

SALOAD

(39)

[$[B i32] ->

[i32]

Null-checked, bounds-checked load from a Jawa short array object

SASTORE

(3A)

[$[B i32 i32] ->

[]

Null-checked, bounds-checked store to a Jawa short array object

SYSCALL

(3B)

Various system calls that fall outside the Java bytecode set, such as I/O, which are normally done with native methods

ZALOAD

(3C)

[$[Z i32] ->

[i32]

Null-checked, bounds-checked load from a Jawa bool array object

ZASTORE

(3D)

[$[Z i32 i32] ->

[]

Null-checked, bounds-checked store to a Jawa bool array object

        Jawa Command Imports. With type imports, global imports, and function imports, a Wasm module can do almost everything a Java program can—except—it can’t yet define new classes. For this we need a final mechanism that provides the body of a class to the Jawa system in order to complete a previously forward-declared class or interface. This operation is unique in two ways. First, its encoding can be quite large and complex, since a class has lists of members, each of which has a name and type or signature. Second, this is the only Jawa import kind that is parameterized by Wasm functions, i.e. it accepts functions from the Wasm module as arguments to the import itself. Those functions are the bodies of the methods for the new class[2].

        Note that this new kind of “command” import wasn’t strictly necessary. Technically we could accomplish this with a type import, but we chose to avoid the confusion of importing the same type twice in two different ways, first by forward declaration and second as a definition. With this design choice, a command import is a side-effecting-only operation and no type, global, or function is returned as a result of processing this import. A command import is effectively a command to the Jawa VM to complete a class or interface.

Command Import (hex code)

Imm

Import Parameters

Description

DEF_CLASS

(4D)

instanceFields: (string, jawa type)*

instanceMethods: (string, jawa sig)*

staticFields: (string, jawa type)*

staticMethods: (string, jawa sig)*

`ct: jawa class type

`jts: jawa type*

`mbs: wasm function*

Finish defining the previously forward-declared class `ct. The class’s fields and methods are encoded into the (quite large) immediate. The bodies of methods are provided as Wasm functions `mbs. Both fields and methods may refer to imported types in `jts.

DEF_INTERFACE

(4E)

instanceMethods: (string, jawa sig)*

staticFields: (string, jawa type)*

staticMethods: (string, jawa sig)*

`it: jawa interface type

`jts: jawa type*

Finish defining the previously forward-declared interface `it. The interface’s  methods are encoded into the (quite large) immediate. Both fields and methods may refer to imported types in `jts.

Targeting Jawa with Offline Classfile Translation

        The JVM’s code format is class files, each of which contain a binary encoding of a single Java class or interface declaration. We built a tool based on JWebAssembly[3] that loads Java class files and performs a nearly 1-on-1 translation to Wasm binaries. The result of running the tool on a classfile is a Wasm binary module that encodes all JVM operations as Jawa imports.

Implementing the Jawa Virtual Machine

        Jawa requires the import meta-language extensions to Wasm described in previous sections, namely abstract type imports, import commands, and parameterized imports. While in theory these modifications could be made to any existing Wasm engine, including fast production engines, we instead found it easier to prototype them in a research engine. The Wizard Engine [link] is a new Wasm engine designed specifically for virtual machine research. Purposefully easy to extend and modify, it is designed from the start to support all of the Wasm MVP features as well as the multi-value [link], reference-types [link], bulk-memory [link], and tail-call [link] proposals. Wizard is written in a statically-typed language, Virgil, which itself can compile to the JVM, Wasm, and native x86 Linux or MacOS. Virgil’s portability allows the Wizard Engine to run on multiple host platforms[4].

Five Strategies.

        1. Host language emulation. Since Wasm imported functions can be implemented in the host environment, the Jawa VM can be implemented completely outside of the Wasm engine itself. In particular, since Wizard is implemented in Virgil, we can supply Virgil functions to the Wasm program that emulate the operation of Jawa bytecodes. In this approach, Jawa objects are represented as Virgil objects and small Virgil functions do the work of each bytecode. Such functions are parameterized, but do not need to be dynamically generated. For example there is a single Virgil function that performs GETFIELD, which is parameterized by the field’s offset. We don’t need to generate code because the host code manipulates meta-values (i.e. boxed values), not values directly. For abstract type imports, there is no need to provide a reified type representation to the engine for execution. Only sufficient types must be created for the Wasm engine’s instantiation-time type checking that type imports match the declared type constraints.

        2. Translation to instructions that use Wasm linear memory. Since a Wasm engine by definition has a Turing-complete, machine-level bytecode set and large, fast memory, the Jawa VM can dynamically generate new Wasm code that implements the semantics of Jawa bytecodes and represents objects as chunks within a (private) linear memory. With this approach, the Jawa VM makes all representation and lowering choices and produces low-level Wasm bytecode that can be treated as normal code by any Wasm engine, even ones with no GC support. Of course, this requires the Jawa VM to implement a garbage collector that operates in linear memory. As we will see later, finding stack roots in this scenario is more difficult, and we cannot use the typical shadow stack strategy the way offline translation can. We sketch a somewhat unappealing solution that utilizes an unsafe contract with the underlying Wasm engine but is able to find precise roots and implement a moving collector.

        3. Translation to a Wasm GC extension. In much the same way that we can translate Jawa bytecodes and objects into linear memory concepts, another strategy would be to dynamically translate Jawa bytecodes into lowered sequences of bytecodes from the Wasm GC extension. As a Wasm module builds up its Jawa classes and methods through the import mechanism, the Jawa VM creates new Wasm GC types that correspond to the structure of Jawa objects and produces new Wasm functions that have bytecodes that access those objects. The Wasm engine can then treat these dynamically generated types and functions as if they had been defined by the module itself. In this scenario the meta-language import mechanism serves as a packaging mechanism to defer lowering of Java until link time—i.e. it is exactly a late binding mechanism. We consider this implementation strategy to be the best long-term hope of achieving proper layering of virtual machines.

        4. Native support in the Wasm engine. Even though we have taken great pains to design an import meta-language mechanism that completely separates the Wasm engine from the details of any source language, there is no reason that a Wasm engine cannot inspect a module’s imports and attempt to understand them. In this strategy, the Wasm engine has built-in support for the Jawa import meta-language and by “deep inspection” of import names understands the Jawa classes, objects, fields, and all operations that are defined and used by the module. It can then simply skip any translation or emulation step and directly support Jawa operations in its type-checking algorithm and its execution tiers. In this scenario, Jawa can be thought of as simply a different class file format for the JVM. The advantage of this approach is that the Jawa VM has no requirement to lower Java into either Wasm concepts or host language concepts. The notoriously-difficult-to-type method dispatch sequence doesn’t need to be typed at all. In essence, the Jawa VM can do exactly what a native JVM would do and compile directly to the underlying machine. That removes any fundamental roadblocks to achieving exactly the same performance as a native JVM. However, of the strategies, this is the least well-layered, since it is completely built into the Wasm engine, and it is difficult to argue that anything is fundamentally simplified or encapsulated better than before.

        5. Pass-through emulation when running the Wasm engine on top of a JVM. The last implementation strategy is to make use of an underlying virtual machine that just so happens to implement the exact language that we are attempting to emulate. In the case of the Wizard Engine, since it can be compiled to and run on an actual (native) JVM, it would be possible to dynamically translate Jawa concepts into JVM concepts for the underlying native JVM. This can be done by way of Java reflection and use of the JVM’s inherent dynamic code generation facility, dynamic class loading. Of course, this strategy only works in the narrowest of circumstances, when the Wasm engine itself is hosted on a JVM (which raises questions about the point of the whole endeavor—like why are we doing this at all!?). Nevertheless, we could expect that in terms of object representations and dispatch sequences, such dynamically-generated classes would be implemented as efficiently as if they had been loaded directly into the underlying JVM from their original class file format. Note carefully that to achieve complete pass-through emulation, the Jawa VM would need to dynamically translate Wasm bytecode to Java bytecode, since the bodies of Jawa methods are expressed as Wasm. This of course would require access to the meta-representation of Wasm methods (i.e. their bytecode) to perform a reverse translation. It’s exactly the inverse problem of the other translation strategies. We could perhaps combine this strategy with strategy 4 (native Jawa support in the Wasm engine) and avoid the meta-representation problem, but it’s not clear that any of this is a particularly productive exercise.

PROPOSED WEBASSEMBLY EXTENSIONS

        In order to support the WebAssembly-as-MetaLanguage concept, we propose the following extensions:

  • Import/export kind 0x06: Abstract type

A new import kind allows a module to import a type from another module (or language, via an import processor) without depending on its representation or any implementation details. In Jawa, this import kind is used for the Jawa class, interface, and array types. It can also export any value type, including ones it has defined or imported, as an abstract type. An abstract type import is like other imports, having a name, but also:

  • flags: a flags byte that encodes boolean properties and allows for binary format expansion
  • externref: whether this type is an externref type
  • has-default-value: whether this type has a default value
  • constraints: a vector of constraints on the type, including
  • subtype constraint: i.e. this abstract is assignable to the given value type
  • Import kind 0x07: Command

        A side-effecting operation, an import command encodes its semantic meaning entirely in the import module and field name. It has no additional bytes following it. In Jawa, this is used to finish class and interface definitions.

  • Import kind 0x08: Arguments to next import

        This import kind encodes a list of anonymous exports to the next import following it. The anonymous exports are encoded just as normal exports (i.e. an import/export kind code with an appropriate index), but with no name. This allows a module to supply one of its internal declarations as an argument to another import. In Jawa, this is used to supply abstract type arguments and function arguments to the import parameters listed in the tables, e.g. bytecodes and class definition commands. This import kind is a vector of anonymous exports, i.e. it has:

  • count: a count of the number of arguments to follow
  • exports: export kind + index pairs of the same encoding as normal wasm exports
  • Value type 0x5A[index]: Abstract type

        This value type allows a module to refer to an imported abstract type anywhere a value type can occur in a module. That can be in a signature, a table, a struct or array definition, or as a local variable. The index is into a new index space of abstract types: i.e. it is separate from the type definition index space (signatures, structs, and arrays). To appear as a local variable declaration in a function, the abstract type must be imported with the has-default-value flag. The flag also requires an import processor to provide this default value when binding an abstract type.

IMPLEMENTATION DISCUSSION

Host Language Emulation

We chose to build the first working prototype of Jawa on top of the Wizard Engine, a new WebAssembly engine written in a multi-paradigm, safe, garbage-collected language, Virgil. This allowed us rapid iteration and did not require the engine to support the GC proposal, only the reference-types proposal. The Wizard Engine is designed first for simplicity rather than speed, so we were concerned only with avoiding obvious, egregious performance bottlenecks. Wizard so far has only an in-place, semi-fast interpreter as its main execution tier. However, the host emulation technique for Jawa is completely independent of the execution tiers of the engine, since all stubs become function calls.

        In this implementation, all Jawa function imports become host functions. Host functions are implemented in Virgil and receive their arguments as boxed Wasm values. They must dynamically cast those boxed Wasm values to host objects (i.e. objects implemented in Virgil). Those objects represent Jawa objects and have an array of Wasm values to represent class fields. Arrays are more efficiently packed; primitive arrays are special-cased and are thus implemented by Virgil primitive arrays. Because Virgil is garbage-collected and the entire Wizard engine is written in pure Virgil (Wasm ref values are boxed into a Virgil ADT), it is not necessary to write a garbage collector for Jawa. Instead, we simply rely on Virgil’s precise garbage collector.

        The Jawa runtime must build up internal representations of Jawa classes, including the names and types of fields, methods, interfaces, etc. This representation is fully in the host language, and allocates nothing in the “Wasm” memory spaces or as Wasm values. These data structures are used to directly implement casts and vtables. User wasm code that uses Jawa only manipulates Jawa objects and arrays; metadata is only indirectly referred to.

        Jawa types are represented in the Jawa runtime as Virgil data structures. To typecheck the user-level code, the Wizard engine allows type imports to be bound to host types, a generalization externref, and an extensibility dimension known only to and agreed upon between the Wizard engine and import processors like Jawa. Though Wasm modules using Jawa are typechecked against the abstract type imports, the Wizard engine does check the bindings of type imports against their declared constraints. Technically, the Jawa runtime could map all Jawa types onto externref, which trivially satisfies type constraints. We found the use of host types to help catch bugs in our implementation, which would manifest as (because Virgil is safe) Virgil-level casts failing because host functions were called with the wrong Wasm values. In another engine, this could result in memory safety violations and thus vulnerabilities.

        The host language emulation does not need to make use of runtime (Virgil) code generation. Instead, generic implementations are used and they are typically closed-over specific metadata values. For example, the evaluation of ANEWARRAY bytecode is implemented as a Virgil function, and another Virgil function wraps that evaluation function with a signature into a HostFunction for the Wasm engine.

        // Creates a new HostFunction that allocates a host-array of type {at}

            def ANEWARRAY(at: JawaArrayType) -> HostFunction {

                    var sig = SigDecl.new(SigCache.arr_i, [ValueType.Host(at)]);

                            return HostFunction1.new("ANEWARRAY", sig, eval_ANEWARRAY(at, _));

            }

        // The host function that is actually called by the program.

def eval_ANEWARRAY(at: JawaArrayType, arg: Value) -> HostResult {

                    var length = Values.v_i(arg);

                    if (length < 0) return JawaTraps.NegativeArraySizeException;

                    var elems = Array<JawaObject>.new(length);                                                                                              

            return HostResult.Value1(Value.Ref(JawaRefArrayObject.new(at, elems)));

}

Here, JawaArrayType is the Jawa runtime’s host type for Jawa arrays, JawaObject is the Jawa runtime’s implementation of an object, and Value is the Wizard engine’s representation of boxed Wasm values. Array is the built-in Virgil array type, i.e. a host language concept not from Wizard or Jawa.

Translation to Wasm GC Proposal

        We also implemented Jawa by way of translation to the Wasm GC proposal. In this implementation, the Jawa runtime processes all Jawa imports and materializes new Wasm types, functions, and globals that implement all of the imports. Thus this represents the most likely production scenario for a layered VM on top of a Wasm engine.

        We discovered that no matter how we define Jawa imports for defining classes, Jawa types are inherently mutually recursive, and thus require more than one processing pass before being able to materialize struct and array types for the Wasm engine. This is in contrast to the host language emulation, which creates incomplete, nominal inheritance types that can be returned throw the import processor API to the engine before they are completely defined. Since importing Jawa classes first requires a forward declaration (i.e. all definitions are topologically sorted), the host language emulation can process all types and class definitions in one pass and provide both host types and host functions. For the GC proposal, we required a prepass to first gather class definitions in order to create struct types that are supplied in the final pass.

        The amount of bookkeeping necessary for translation to the GC proposal, i.e. the implementation effort in terms of Virgil code, is much more than in the host language emulation. In general, everything is a bit more complex, since instead of writing our representation of types, or code which evaluates a stub, we are writing code that generates the representation of a type or generates the code that implements a stub. Nevertheless, the GC translation will allow the engine to manipulate only Wasm code in the future, and inline generated runtime stubs.

For comparison, the same ANEWARRAY bytecode in GC translation:

def NEWARRAY(at: JawaArrayType) -> Function {

var gcrep = makeArrayRep(at);

makeArrayMetaObject(at, gcrep);

start0()

.global_get(gcrep.metaObjectGlobal.global_index)

                            .local_get(0)

                            .global_get(gcrep.elemsRttGlobal.global_index)

                                .array_new_default_with_rtt(gcrep.array)

                            .global_get(gcrep.rttGlobal.global_index)

                        .struct_new_with_rtt(gcrep.struct);

return makeFuncP(SigCache.arr_i, [gcrep.valueType]);

            }

Notice here that this translation is making use of an assembler which is dynamically creating new Wasm bytecode to create a new Wasm function. Also note that in this translation the RTT necessary to tag a Wasm struct object so that it can be dynamically cast is being added to the object by loading it from an internal global variable. Thus the RTTs for arrays, objects, interfaces, etc, only appear in the stubs, and user modules never see them.

        Where do these generated Wasm functions and generated Wasm types live? After all, the GC proposal bytecodes are encoded in such a way that they refer to their types and globals by index. Our translation of Jawa makes use of a Jawa comodule, i.e. a Wasm module that contains globals, functions, and type declarations that are incrementally added by translating the Jawa imports, including class definitions, imported bytecodes, etc. Incrementally building this Jawa comodule requires the Jawa runtime to have intimate knowledge of the Wizard Engine’s implementation details. In the future, this interaction needs to be specified in a more modular way, e.g. by an API that allows a runtime to dynamically JIT new globals, functions, and type declarations into a module. This is trickier than it sounds, because import args means that a module and its comodule are actually dependent on each other.

TRICKY BITS

Here we list some of the more difficult, surprising, or tricky parts to implementing Jawa with the GC proposal.

  • Interfaces. INVOKEINTERFACE, CHECKCAST, and INSTANCEOF are implemented as a linked list of interface dispatch tables hanging off the metaobject (class object). Each of these are a search of this linked list using RTT to check if one of the links of the chain matches. This list must be duplicated per subclass, so it is not as space efficient as HotSpot. These operations are 13, 24, and 22 bytecodes, respectively when expressed as a Wasm stub.
            
  • Covariant reference arrays. This is fairly inefficient. First, without any type of variance, Jawa must basically dynamically check both loads and stores to arrays. Second, to fit the other header fields into the object, the array is an indirection away from the object.
            
  • Null type. Null being a bottom type in Java, it is not possible to just use (an unannotated) ref.null whereever another type is expected. Solutions are to either allow a hack (ref.null externref) or to explicitly import a Jawa null type that is a subtype of all reference types (requires engine hack to make subtype test cheaper).
            
  • Null casts. In the GC proposal lowering strategy, casts trap on null, which         requires additional redundant branches to avoid.
            
  • Mutually recursive types. Java classes can refer to each other through fields, so it requires two         passes on Jawa imports to first resolve class definitions before GC         proposal structural types can be created. Trickiest part of the whole exercise.
            
  • Primitive types in the source language. Encoding the difference between Java-level primitives and Java reference types into the import system.
            
  • UTF-8 bottleneck. Required creative escaping of integers.
            
  • Indirections in the object model. Especially with Wizard/Virgil overhead. (make diagram). Realistically, to compete with the JVM, we need to be able to smash together header fields/metaobjects.
            
  • Extra header fields. Java         requires header space for:         
  • monitor lock/owning thread                 
  • identity hash code
  • vtable/metaobject                 
  • array length

Whereas the Wizard engine also needs to add headers to store the RTT for a struct, and Virgil adds a header for its own purposes. To be competitive with the JVM, we need to be find a way to combine all these headers across the stack.
        

  • Catching traps. The Jawa runtime needs to intervene when traps (null dereference, array index out of bounds, divide by zero) occur, and instead throw exceptions,         which requires additional engine API.
  • The Java platform is huge. Not surprising, but the JDK has thousands of classes and the language is intertwined with the class library. In order to make progress, the Jawa runtime defines a few primitive language classes and native methods but mostly punted on bringing the entire JDK along, which is a huge amount of work.

[1] https://dl.acm.org/doi/10.1145/3276521

[2]        Technically we could have encoded the entire body of a function, including its code into the import. We could have even encoded JVM bytecode there! But by allowing Wasm functions to be combined together into Jawa classes, we don’t need another bytecode format or another way to declare functions just for Jawa; we just reuse Wasm functions, which the Wasm engine has already typechecked and will compile and make fast!

[3] https://github.com/i-net-software/JWebAssembly

[4]         Interestingly, because Wizard provides a platform that it also can be compiled to, it can therefore self-host, executing a copy of itself as a user Wasm program. Of course, that copy can in turn execute a copy of itself as a user program, and so on, as many levels deep as there is available memory.