After creating Wake, which AOT compiled into JS, I wanted to go from AOT to JIT, from the "assembly" of the web to, well, real assembly!
I imagined a sort of ncurses-as-a-language TUI framework (Text User Interface), and that gave me an excuse to create a new language. I think it will work better as a language than a library for reasons detailed in my original blogpost about why I wanted to make it. This is not entirely crazy, as Elm sort of took the same approach. The TUI part is still completely untouched because it’s really something outside of my expertise. Still, it’s possible that a good, tailored JIT like what I have started with Vaiven will make such a TUI language possible.I was also a few months into newly joining the Dart team. Dart was created by the minds behind V8. As a perk of my job, I was able to ask some brilliant VM engineers questions, and I wanted to understand the roots of my project language as well as I could. The current state of Vaiven has some gaping holes — it has no object literal syntax, no load/store optimization, or stack traces, and the GC is a pile of hacks for reasons I wish I had space to fit into this post. Nevertheless, what I was able to do taught me a lot about JITs in general — how to best use them, when to trust or distrust them, how a JIT affects language design. And what I didn't do has arguably taught me even more! I simply needed to write up my thoughts about it. Vaiven has a very keyword-based syntax, and so defining a function looks like this:
fn foo is 0 endTyping that into the debug version of the JIT gives you a bunch of assembly output, shown below. At the moment, functions are eagerly compiled for the first optimization level (which is, almost none). This can be lazy in the future.
push rbp mov rbp, rsp mov eax, dword [+94810657183568] ; mov v0, dword [+94810657183568] add eax, 1 ; add v0, 1 mov dword [+94810657183568], eax ; mov dword [+94810657183568], v0 cmp eax, 10 ; cmp v0, 10 je L2 ; je L2 L10: mov rax, 281474976710656 ; mov v3, 281474976710656 jmp L1 L2: mov rdi, 140730010364832 ; mov v4, 140730010364832 mov rsi, 94810657183664 ; mov v5, 94810657183664 call 94810634424347 call rax L1: mov rsp, rbp pop rbp retMost of this code is going crazy with giant integers — these are magical addresses specific to the compilation of this function. Each function gets, in addition to assembly that can run it, an object that tracks the data types used and waits for the function to get “hot” to optimize it.
The first value,
94810657183568, is the memory address of the call counter. You can see we get the value at that address in
dword form (32 bit int). We add
1 to it to track the current function call, and we then store that 32 bit result back in memory.
cmp eax, 10. So on the 10th invocation, we
je (jump on equal) to
L2, which has more fancy big numbers!
These are addresses to the source code, and the profiling info object of that function. We call the optimizer function by address, with the other two pointers being arguments. Lastly, we
call rax to execute the newly optimized assembly that we just produced!
In the "cold" case, we do not
je (jump equal), so we execute the
L10 block. In this case, we load a special 64 bit value which represents the number
0 into the return register and then jump to
L1 which is the exit of the function (performs common stack cleanup).
The value 0 here is represented by a special 64 bit boxed value using a method described in this paper. This allows us to store and differentiate 64 bit pointers, 64 bit floats, booleans, null, 32 bit integers, and more, all inside a single 64 bit register.We can call this function from the interpreter by using the REPL:
Int: 0And after doing so 10 times, we trigger the optimizer:
var x = 0 Int: 0 for x < 10 do foo(); x += 1; end L0: push rbp mov rbp, rsp L3: mov rax, 281474976710656 ; mov v0, 281474976710656 jmp L1 L2: L1: mov rsp, rbp pop rbp retYou can see how the assembly we generate is not perfect; the
jmp L1is useless, but the code isn't aware of that because it is trying to avoid executing the (empty)
But otherwise, all we do is save and store the base pointer (for GC purposes — in this case it could be optimized out by a smarter JIT) and move our special
0 value into the return register.
fn bar of x is if x == 0 print(foo()) end endwe get a much more significant amount of asm on the first pass:
L0: push rbp mov rbp, rsp push rbx sub rsp, 8 mov eax, dword [+94775382205392] ; mov v1, dword [+94775382205392] add eax, 1 ; add v1, 1 mov dword [+94775382205392], eax ; mov dword [+94775382205392], v1 cmp eax, 10 ; cmp v1, 10 je L2 ; je L2 mov rax, 140737488355327 ; mov v2, 140737488355327 cmp rdi, rax ; cmp v0, v2 jae L4 ; jae L4 mov rcx, [rdi] ; mov v3, [v0] shl rcx, 4 ; shl v3, 4 jmp L3 ; jmp L3 L4: mov rax, 2251799813685247 ; mov v2, 2251799813685247 mov rcx, 8 ; mov v3, 8 cmp rdi, rax ; cmp v0, v2 jae L3 ; jae L3 mov rcx, rdi ; mov v3, v0 shr rcx, 48 ; shr v3, 48 L3: mov ax, word [+94775382205396] ; mov v2, word [+94775382205396] or ax, cx ; or v2, v3 mov word [+94775382205396], ax ; mov word [+94775382205396], v2 L12: L15: mov rsi, 281474976710656 ; mov v4, 281474976710656 mov qword [rsp], rdi ; [Save] call 94775360292932 test eax, eax ; test v5, v5 jz L14 ; jz L14 L13: mov rbx, 94775382202000 ; mov v11, 94775382202000 call [rbx] mov rdi, rax ; [Move] call 94775360285928 L14: mov rax, 562949953421312 ; mov v9, 562949953421312 jmp L1 L2: mov qword [rsp], rdi ; [Spill] mov rdi, 140730134516032 ; mov v12, 140730134516032 mov rsi, 94775382205680 ; mov v13, 94775382205680 call 94775360237595 mov rdi, qword [rsp] ; [Alloc] call rax L1: lea rsp, [rbp-8] pop rbx pop rbp retI won't break all of this down, but I will say that this code tracks bits for the data type of
xin its hardcoded metadata, it calls a function to compare
0(because comparison is special-cased for Strings), and it looks up
foo()'s current compiled address to invoke it. The rest is the same. Call this function 10 times, passing an int each time, and you get the following:
for x < 10 do bar(x) ; x += 1 end L0: push rbp mov rbp, rsp sub rsp, 16 L3: mov rax, rdi ; mov v5, v0 shr rax, 32 ; shr v5, 32 cmp rax, 65536 ; cmp v5, 65536 jne L2 ; jne L2 L13: cmp edi, 0 ; cmp v0, 0 jne L14 ; jne L5 L4: mov qword [rsp], rdi ; [Spill] mov rdi, 281474976710656 ; mov v2, 281474976710656 call 93978509257960 L5: mov rax, 562949953421312 ; mov v4, 562949953421312 jmp L1 L2: call 140676951605440 L1: mov rsp, rbp pop rbp ret L14: mov qword [rsp], rdi ; [Spill] short jmp L5Obviously this code is shorter, but why?
Firstly, it checks to see that the value passed into
x was actually an integer like it has seen before. If it’s wrong, it jumps to
L2 to call the deoptimized code which can handle this case. Otherwise it continues uninterrupted to
That block then knows (thanks to the earlier invariant checks) that it can simply and safely run a 32 bit comparison on
0, and jumps if not equal (
jne) to the function exit. And when
x does equal
0, it continues to
L4 which is the code for
Note that the call to
foo() is inlined (
foo always returns
0, remember?), and the special tagged
0 value is passed directly into the
print() function, saving call overhead. Both cases (
x == 0 or not) load the return register with the special
562949953421312, before exiting.
Lastly, if we run a simple
fib.js comparison benchmark, Vaiven currently beats v8 by a hair and loses to Dart by a step:
[fibvvn] 12.676s [fibjs] 13.400s [fibdart] 10.956s :: fibvvn 0.86X faster than fibdart (10.956 / 12.676) :: fibvvn 1.06X faster than fibjs (13.400 / 12.676)You should not be impressed by Vaiven beating v8 here, as it is over 5x slower in my other benchmark (lcf). Still, it shows that there are certain, very limited places where the Vaiven VM can act like a real JIT.
Before creating Vaiven, I thought of JITs like V8 as optimization engines. And of course that's true, but there's a missing piece to that statement. JITs are actually phenomenal deoptimization engines. One thing that I have not yet implemented in Vaiven is the ability to profile types across call sites. That means that this:
fn addOne of x is x + 1 endcan be optimized into native machine addition, and this:
fn addOneToResult is getNumber() + 1 endcan not. Why? Because a JIT VM has to go through a very crazy dance when
getNumber()does the wrong thing and doesn't return an integer like before. The optimized code has some set of registers that mean certain things at different times during the function execution, and uses those values efficiently. The unoptimized code does too, but those registers and stack values mean completely different things.
getNumber() returns a string, you are in the middle of one piece of assembly code, and need to hop across that barrier into a different piece of assembly code, and you need to dynamically fix up your registers to the right stuff in order to do it. This is a slow, expensive process, and one that I couldn't do because my assembler's register allocator simply doesn't track enough data (that aside, my assembler, asmjit, was extremely nice to use).
var x = 2 * PI;and not this code:
var x = 6.243.... // 2 * piThe compiler can do this for you, even in a JIT. The biggest example of this lesson is that Vaiven can beat V8 on certain benchmarks. Does this make Vaiven a faster language than JS? Absolutely not. No, its faster because I picked it, and I picked it because I knew Vaiven would be able to do it quickly.
Different versions of v8 will have certain bugs, and any optimization pass can get confused by simple things. This is especially true in a JIT, where optimization passes need to run quickly.Related to lesson two, you never really know what you're testing. You may think that you're testing multiplication here:
Switching to AOT compilation can in fact make things common in dynamic programming become suddenly very slow. See this blog post about what Ruby developers need to adapt to if they want to get the high performance promised by Crystal.Sure, there are things that can be "poison pills" for performance, even in JITs. For example, another reason why my fib benchmark beats V8 is that V8 has to continually check if the fib function was redefined as a deoptimization check. I don't allow that in Vaiven, so I can produce faster code. Dart was designed to have fewer of these poison pills, for instance. There will always be applications for bare metal programming, and C/C++/Rust/ASM will always have a niche to fill for those cases. Specifically, the ability to optimize data structures themselves, and offering the programmer ways to improve CPU cache performance by increasing memory locality, are the types of things that may never be optimizable by a JIT. I hope that one days JIT get us 99% of the way there, by improving our JITs or the languages we're JITting. We'll see, of course, as both of those things evolve over time. Making a JIT turned out to be much more similar to writing a C compiler than I had thought it would be. The amount of code dedicated to normal SSA optimization is much greater than the amount of code dedicated to logging type info, communicating between compilation strategies/phases, and handling code changing out from under you. It was absolutely a fun experience, and I am hoping to continue making Vaiven a useful JIT so I can plug it into other things, or just leave it on a shelf with a pretty frame. I encourage all readers who do or don't use JITed languages to trust them, even when it seems impossible that they could really be that good. They really are (or can be). Throw out your micro-optimizations, and choose a language that will make your code easy to work on and maintain. Focus on algorithmic performance instead of line counts. In the end, this is just another blogpost backing the age old adage, coined by Donald Knuth, that premature optimization is the root of all evil.
Editing Credits: Dulce Flores Cruz, Wm Leler