[Note: please be advised this transcript contains strong language]
I'm Peter Alvaro. I'm an amateur language designer and I'm also a professor of computer science at UC, Santa Cruz. This talk is called "Three Things I Wish I Knew When I Started Designing Languages." Before I get into the talk, I want to go through an obligatory slide or two about me. This is the picture of myself that I like and that I like to give to the world. That's me, that's your boy. He's a sort of quixotic, cocksure, paladin, trying to change the way that systems programming people think about how they do what they do using - you guessed it declarative languages. Obviously, poor guy, he'll never succeed, but he's doing God's work, right? And we want to root for him anyway.
But of course, on the inside, the picture is significantly more nuanced and significantly less romantic. Let me tell you some facts about me that I've never told anybody publicly before. I don't belong here. I don't have any mathematical maturity whatsoever. I was such a bad student in high school, I didn't make it to calc, I didn't even make it to physics. I didn't get serious about my studies until college, and when I was in college I wasn't studying computer science, I was studying literature, just reading books. And then, I was unemployable after that. Obviously. I had to claw and scratch my way into a job that could pay me enough money to live out here when I moved out here, and I did that by getting a job as an editor at a search engine, and laterally moving my way into an ops role, a DBA role, a software engineering role. But truth be told, I only did that software engineering role for about five years before I went to grad school.
Then when I went to grad school, I didn't go to a grad school in programming languages. I am not, nor was I ever, despite what you may have heard earlier today, a programming language researcher. So who the hell do I think I am getting up here and talking about design? One of these is a lie by the way. See if you can figure out what it is. So you may have heard this expression that “data is the plural of anecdote”, but I'm afraid all I have to offer you today is some anecdotes. I hope you enjoy them.
The meat of this talk is really going to be about a few things that I got right somehow, a few strong opinions that I took early on, planted my feet, and that I used to guide my design process. I strongly think that in the future as we design new languages for new demands, it's important to lead with strong opinions. You can always attract the opinion later, but if you don't plant stakes in the ground, you'll never be able to start exploring the area in a meaningful way. Everything will be too amorphous. As a really thin bread around that meat, I want to lead in by talking about some of the ingenious arguments I came up with to convince myself not to bother designing a language, to convince myself that it was a bad idea, arguments that I hope I can convince you you can stop making for yourself. And then at the end, obviously, I'll talk about a few things that were surprising that I learned along the way.
Prelude: Misapprehensions and Misgivings
By way of prelude, I want to talk about FUD. You've all heard the term FUD. Fear, Uncertainty, and Doubt, it's something that typically companies sow in their potential customers about other companies. I think IBM coined the term. Your salesmen telling, "Oh, you could use our competitor, yes, hmm. I hope that works out for you." I think FUD is also something that we do to ourselves when we ask whether it's a good idea to do this thankless and laborious effort of language design. And I want to identify three killer bad arguments that we use to convince ourselves not to do it.
The first is the really superficial one, and it has to do with the look of a language. Amateur language designers like me tend to believe that a new language, a revolutionary language, should look unlike any language you've ever seen. It should look like the domain it's modeling, it should look cool, and it should look different enough from every other language that's every existed. You may be surprised to hear it, because it seems so superficial, who cares about syntax anyway? But I probably lost more time in this time sync in the early days of language design than any of the others. Because I would sit there on the beach with a notebook, or I'd sit there in bed with my Google docs, and I had this process where I had a number, I was trying to write a new language for protocols. And I had a bunch of protocols, Paxos and so on and so forth that I thought were really hard. And the exercise went like, "Can I invent a new syntax that makes it easier for me to express these things?" And I would try some two-way syntax, sometimes just an English language syntax.
But invariably, 9 out of 10 times, after I was done, “Here is my attempt to implement two-phase commit in some new language that you've never heard of,” I sort of came down. “Oh my God, fuck man. That's just Erlang.” And that's okay. But that would be enough for me to abandon the attempt. This language is not sufficiently new-looking. I wasted a lot of time in that way. I didn't start making progress until I gave up thinking about what a language is supposed to look like and said, "I'll figure out the look last. I want to figure out what the domain of discourse should be of the language. What should I be talking about? Not how should it look.”
Moving from the superficial to the slightly more fundamental, it's very natural for us to want languages to address a real external need. We shouldn't design a language unless the world needs it. Think about this as, the world doesn't need any new languages. The world doesn't need my language. The fact that the world needs a language is a bad reason to design a language. You stymie yourself all the time. I start to make progress on a language and then I'll be like "Whoa, wait a minute, wait a minute, wait a minute. Does this need to be a language? Could it be a library? Does anybody care? Will anybody care about my language?" And the issue of need is even worse when you open it up to external opinions. If you're thinking about designing a language to solve a problem, the last people you should tell about it are your friends and your broader community, because they're going to piss on you from a great height.
During the first year of my grad studies, I went to my first conference and I told a luminary in my field that I was working on designing a programming language for distributor community, and he literally laughed me out of the room, and I was a student and this was a bunch of senior people. I got laughed out of the room, I had to leave the room. Another year later I was at Line.Next at Microsoft presenting on this work, and a luminary in our community and a couple of other people just stormed out of my talk in the middle. Yes, so, they didn't like it.
And when I presented at my job talk at Harvard, a systems researcher who I admire very much, said something along the lines of, "Yes, this kind of reminds me of a Racket, and in Racket everything is a parenthesis. So, in your language, what is the thing that is everything that I don't buy?" That was nice. And don't even get me started on the Orange website. Don't ask the Orange website if they think you should fucking design a language. “How is this different than elixir?”, or they might just talk about something that they know that has nothing to do with what you're talking about. Isn't our community amazing?
Okay, let me get serious for a minute though. This one is really hard for me. Because I would love to get up here and say, “A, any language that we all design is almost certain to have no impact, in the sense that it's likely after 10 years to have one user, you.” And I want to say, “That's okay, do it anyway. It's worth doing anyway.” I believe that to be true. But I feel weird saying it, because people like us, the sorts of people who do this perverse and thankless activity of engaging in language design care enough about the world that we would like to see the world change, even if only a little bit under our influence. We would like to change the discourse a little bit, change the way that people think. It doesn't necessarily have to be measured in users, but we'll be heartbroken if we don't have some kind of impact. So how do we resolve this?
I think we accept the fact that we won't have impact. We find reasons to do language design anyway. If only because if we target impact, if we design our language with the desire to have a lot of users in mind, we're going to design the wrong language. We're certainly not going to design a radical language. We're probably going to design something that looks like Go. I didn't say I hated it. I said it was the wrong language though, you're right, I shouldn't have said that. I take it back. I take it back. I will take off all my other Go stuff throughout this talk.
Lucky Guesses: Things I Got Right
Enough about the doom and gloom. Let me tell you about some of the strong opinions that I just sort of asserted by Fiat and I think I got lucky, and they guided my design. The meta-theme of this talk, by the way, I'm just going to say it explicitly, is the primacy of context. I'm a bottom-up thinker, and bottom-up is the only way I know how to do design. You pick your context first and then you proceed from. You can always back up and change your context, but you plant stakes, and you establish context, and you use the context to frame the discussion. Otherwise, it's just an amoeba, it's just a blob. And as a first example of context, I would like to make the following indefensible statement: "Every language is a domain-specific language." Okay, now I have your attention. Every language obviously isn't a domain-specific language except in the vacuous sense that there is a domain called general purpose languages. But I think I can get away with a slightly toned-down version of this, which goes like this: "In the future, all radical new languages will be domain-specific languages."
Those of us who are designing those radical languages had better damn well pick a domain first. Because once you've picked a domain, you could circle it. In fact, maybe in some sense, language design is kind of like a drawing compass. It's just this object lying on my desk until I pick a domain, but once I've fixed the compass' foot, I can draw a circle. I can try to frame a domain of discourse around that domain. Maybe languages are like tools.
Now, in my case, choosing an actual domain, I got a little lucky. There was a domain that I was already kind of care mad about, a domain that I've been struggling to program for a number of years, and feeling like the tools and the languages were inadequate. I'm not going to lie, in choosing this domain I was thinking about impact a little bit. I was applying a piece of POP math that I believe is due to Eric Meyer. Anybody know Eric Meijer? He's another quixotic, cocksure, paladin of language design. And he said that the likelihood of adoption of a real paradigm changing language, a revolutionary language, is likely to be proportional to the magnitude of the crisis that the language might address, divided by the perceived pain of adoption of the language. And in the domain that I care about, systems programming, the denominator is sure to just be an astronomically high constant. Unless your language kind of looks like C, in which case maybe not.
The way you win this is you choose a domain that everybody agrees we're in deep trouble, and distribute systems seems like a good one for that. There's a great many others, and some emerging ones. Once you've framed the domain, you can begin to ask a series of questions that would have had no structure otherwise. What is the right language? It doesn't mean anything. But if you say, what is the right language for domain X? You can begin to frame a reasonable question. You go, "Okay, well, what's the right language for a domain?" Well, maybe it's a language that hides a bunch of details that are distracting, or not germane to that domain in order to, in so doing, expose a couple of needles. What are the needles? The needs, well, I guess the needles are equivalently the fundamental things about the domain that you want to be able to talk about explicitly, or maybe they're the hard things about the domain. More than likely they're the same. The things that are fundamental and unique to a domain are the things that are most challenging. They're the things that should be elevated and exposed and talked about explicitly in the language.
So you ask yourself, "Okay, what are the needles for my domain?" Oh, maybe then language design isn't so much using a drawing compass, maybe it's more like operating a pitchfork to clear away the distracting things, or duly operating a needle to extract the fundamental things. Languages are tools, huh. Zooming in a little further, I asked myself, what is it that's so devilishly hard about distributed systems, programming them, reasoning about them, etc.? I think I can make a list, and I'm inclined to make an ordered list, and the number one thing in that ordered list has to be program correctness. The hardest thing about distributed systems is, you've written a program, it gives rise to this huge space of possible behaviors due to things like non-determinism, and the execution machines failing, routes flapping, messaging being reordered. And you have to convince yourself that it produces the correct outcomes under all of those contingencies. You have to reason from a program through a maze of behaviors to a space of outcomes. That's super hard, and not well served, I would argue, by existing tools.
You know what else is super hard? It's the dual of program correctness, debugging. On the left side we're asking, does this program, under all these contingencies, always do the right thing? In debugging, we've run the program, it's done one of these contingencies, and it has done the wrong thing, and we want to know why. We'd like an explanation in terms of the program, its input, and the things that happened during the execution why I got X when I expected Y. That's hard for all the same reasons that program correctness is hard, and another, which is that the observed effect of some bad thing in the distrusted execution is very often remote in both space and time from its causes. So it's not just an intellectual challenge to obtain that explanation, it's also a technical challenge to combine all those logs, and do a big Hadoop job, It's a big pain in the ass to do distributed debugging in addition to hurting our brain.
I would argue that our community, the current state of the art, current programming languages and current tooling, aren't really trying to solve these problems. Some people are doing TLA, and that's great, and more and more people are instrumenting their code to do distributed debugging, and that's great, but we don't have an end-to-end solution to either of these problems and we haven't really made a lot of progress over the last 30 something years that we've been studying distributed systems. And to be sure, there are other things that are hard about distributed systems. But I'm inclined to say they're not top shelf concerns, in large part because distribution isn't what's fundamentally hard about them. It's just incidentally hard, if you like.
So distributed systems are large software systems that are difficult to maintain, difficult to do bug fixes, difficult to release new features, minimizing dependencies on other running code. But our community has done a lot of work, a lot of good work over the last decade, thinking both about organizational processes that make it at least easier, architectures that are ideally suited to being able to do incremental release with minimal dependencies. Our community has made a lot of progress in this area. That's why I'm not going to talk about it anymore. There's hope there. But that's not what's really hard.
The other thing that our community has done a lot of great work on, and something that is certainly hard, is the problem of keeping code portable and manageable in heterogeneous environments. There's been a ton of great work in virtualization. It's easier than ever to think about our code running in an extremely heterogeneous environment, in a multi-tenneted way that maximizes utilization, and that's great. But you could see why I'm inclined to argue that focusing on the things that we can fix rather than the things that are going to sync us is akin to rearranging the deck chairs on the Titanic. If you are a rearranger of deck chairs, and not a melter of icebergs, it's fine to rearrange the deck chairs. You should put in the work wherever it can go, but the ship is still going to sink.
Why? Why is it so hard to write correct programs, and why is it so hard to debug them? Well, the more I thought about it, the more it seemed to me that it had to do with the paradigms, the languages we were using to write these programs. As I argued a moment ago, in some sense, from some really high level of abstraction, all a program is, distributed or otherwise, a description of a mapping from inputs to outputs. In all correctness, there's constraints on that mapping or constraints over the outputs. So, program correctness is really a matter of data. It has to do with whether the transformation from input to output data is constraint persevering or not. The behaviors that are exhibited in the distributed execution are arguably sort of a distraction, that data is finding its way through this maze, but data is finding its way through this maze if we've written our program in a traditional language in a very sort of implicit way.
Let me explain what I mean. All current programming languages for distributed systems- correct me if I'm wrong- irrespective of the paradigm, whether it's imperative, or functional, or whatever, focus the programmer's attention explicitly on control flow. And only implicitly on details like data flow and time, okay? What do I mean by that? Well, if I've written my program in an imperative language, I've structured my computation around current structs like loops and branches. If I'm using a functional paradigm, I've structured it around reccursion and functional composition, but at some level, I'm asking the programmer to live in this intermediate space between the program and the behaviors that it happens to induce. The reasoning about data, following the data through the system, is something that's very challenging to do, because data is relegated to a second class role in traditional languages. Data is stuff that is marshaled into function arguments and flows out of functions returns, and finds its way into exception handlers and things like this. Data is the thing that is stored in variables that persist their values across loop iterations and things like this.
Similarly, ordering and timing are implicit in these languages. In an imperative language, if I want to combine two chunks of code, A and B, the straightforward and obvious way to do it is to say A;B, do A and then do B. Which makes tons of sense if B has a data dependency on A, but which makes no sense whatsoever if they're independent. Similarly, in functional language, all our data structures are ordered. We can post functions in a particular order. Order is sort of everywhere, but you don't talk about it, you just sort of get it implicitly, and that seems upside down to me, completely upside down. A language that made it easy for me to reason about data flowing from inputs to outputs, so I could reason about correction, or falling in the reverse direction from outputs to inputs so I could reason about debugging, would instead make data explicit, allowing me to reason about how data is flowing through the system, so that I could focus on the things that could cause my program to be incorrect, inadequately constrained change of that data.
So, what I want, I have no idea what it looks like, but I want a language that allows me to talk about data and time as first-class citizens, even if that means talking about control flow as a second class citizen. What if control flow was the thing that was implicit, it just happened? A programmer doesn't talk about behaviors, a programmer talks about data, and says, it can change in this way. Or maybe, sometimes, I can't control how it changes over here, so I need to move the control around the system. And maybe everything else is a distraction. Maybe the needle is really data, and time is a needle that provides context for the data. Maybe control flow is the haystack, and another haystack then, in order to make all this work, is details about the representation of the data.
It's really hard for me to track dataflow through a system if sometimes the data is scalars and variables, sometimes it's in tree data structures, sometimes it's marshaled into a message, sometimes it's in a register, sometimes it's in my memory, sometimes it's laid out in blocks on my disc. Wouldn't it be really nice if I had like a uniform representation so I could see data for data and I could follow the data through the system? This is kind of where I was thinking. Then it occurred to me, a compass can draw a circle by fixing a point, but if we were to fix another point, like the things we want to talk about in our language, everybody knows with two points we can draw a more precise shape, two points in a piece of string we can draw an elipse, and try to sort of get even closer to capturing the demand that we want to talk about. Yes, maybe languages are pins and string.
I knew what I wanted my language to talk about, but I had no idea what it looked like, or how you could use it, or what its semantics were. I had six or seven years to screw around with this stuff, so I did. A lot of time passed. I spent my time on Google docs thinking about what the right state primitive should be. How can we talk about data changing? Whatever language I come up with there should be a way to efficiently evaluate it, that's pretty important if you're a systems person. I screwed around more with syntax. I got really upset about this thing that Jeff Ullman said in one of his books, in which he argued that encapsulation, this thing that you really want, you hide an implementation so that you can abstract above it, and declarativity, allowing a programmer to give a precise description of what should happen in a computation but not necessarily how to do it, appear to be very much at odds with each other.
So, declarativity tends to work really well for very small programs, but when we want to build big programs out of small programs, stuff gets fussy. I'm not going to talk much more about this in this talk, but, this is a problem. So anybody in the audience who is interested in declarative programming, I'd love to hear if you have any ideas about how we can do a better job of hiding encapsulation and reuse in query languages. And then, of course, the number one thing that occupied all my attention and kept me up at night, which was how do we, in a reasonable way, talk about uncertainty? How does a language talk about what it can't talk about? That's weird.
But I was a grad student. So I wasn't just writing, I was reading too. I want to take a little bit of a detour, which I hopefully have time for, into a theory that provided a lot of both inspiration for me and validation of my approach. It's a piece of theory called descriptive complexity. Has anybody ever heard of descriptive complexity? I didn't think so. It's a subfield of finite model theories. Has anybody ever heard of finite model theory? It doesn't matter. So, there was this finite model theorist named Ron Fagin at IBM. And a year before I was born in 1974, he made this profound discovery. It's the coolest thing ever to me. And I'm not a complexity wonk. I just think it's really cool by analogy.
He observed that the complexity class MP, the complexity class, the class of problems that can be solved by a non-deterministic automaton in polynomial time, and whose solutions can be checked by a deterministic automaton in polynomial time, is precisely the same as the class of problems that can be expressed in a somewhat arcane logic known as second-order existential logic. Anybody ever heard of second-order logic? Cool. As we all know, first-order logic allows us to talk about members of domains, and their relations to each other. It allows us to quantify over those members of domains; there exists in X such that it has some property, or for all Y there's some property. Second-order logic allows us to, in addition to all that, quantify over sets and relations themselves. It allows you to say, there exists a set with some properties, or for all sets, some properties. Second-order existential is the fragment of second order that allows you to say, there is a set with some properties but not to quantify universally over sets.
Ron noticed that this complexity class, which we usually think of in the context of computational machines- we have these measures of complexity, space, and time, that would appear on the surface to be very closely tied to our model of computation, Turing machines. This infinite tape and a head that writes ones and zeroes on the tape. It makes sense in that context, because you can ask yourselves, "To solve a particular problem, how much tape do I need, and how many times do I need to write something on it?" But that's not really satisfying from a mathematical point of view. How do we know that that's fundamental, that NP really means something outside the context of the universal computing machine?
Fagin's result suggested that there may be a rich correspondence between languages and the exact class of complexity that they characterize. So, I'm not going to spend a lot of time of time on this, but this is really cool stuff. You’ve got to take my word for it. You're familiar with the three coloring problem. You have a map, and the question is, can I, just using three crayons, color in all the countries in the map such that no two countries that are adjacent are colored the same color. It’s easy to see that we can express this in second-order logic. We just say there exists a relation, red, yellow, and blue with some assignment of values to them, such that for all countries, it has a color. It's either red, blue, or yellow, and for all other countries, if there's a border, then they're not both red and they're not both yellow and they're not both blue.
On the one hand that's cool because we can see that an NP-complete problem can be written down in second-order logic, but many of you are saying, “Who cares? I don't know how to actually execute this.” But, if you look at it from the other side, it's really interesting too, because let's say that I had an assignment of values to those three relations, red, yellow, and blue. Then it's easy to convince myself that given that assignment, this first order fragment of the same, it could be very efficiently checked. I just look at the map after I've colored it and I see if there are any two countries next to each other that are of the same color. So there's an efficient polynomial time check for a solution, but it's by no means obvious how we assign those values. To a first approximation, you’d have to try every color. That sure sounds like an NP-complete problem to me.
Similarly, SAT, there exist a Set, S, of the true variables, such that for every clause there exists a variable, such that if the variable is set to true in that clause, then we set it to true in S, and if it's set to false in that clause then we haven't put it in S. Again, if we had an S, it would be easy to check if this held. If we had an assignment it would be easy to check. But it would appear that we have to try all the possible assignments. That sure sounds like a problem that's in NP. Everybody who ever studied descriptive complexity, even if they only got to page two of the book, spent a couple two, three months freaking out about this picture, because it's on the inside cover of the book.
For many people, this is as far as you get, but this is still super dope. What you see, I have it here on the right is, these are complexity classes. So this is the class of bounded circuits, and this is the log time computations, this is the polynomial time computations. This is MP and CoMP. And then way up here, oh, hey, look, it's recursively enumerable complete, there's a halting problem up there. So all of our friends from the zoo of complexity are over here on the right. Like I said I'm not a complexity wonk, so I only know a couple of two, three of them.
What's highly crazy is that over here on the left, these are fragments of logic. So, the bounded circuits are exactly the things that you can write down in first-order logic. The log time computations, such as two-color, which is log time complete, are exactly the things you can write down in first-order logic if you've augmented it with a deterministic transitive closure operation. If you augment it with a least fixed point operator, you can express the polynomial time computations. If you take second-order logic and add least fixed point, you can express that. This is just so amazing. There's a one-to-one between the words and constructors that we use to talk about a mapping in the most abstract sense, and how hard it could be to evaluate that, regardless of how you evaluated it. Well, that's cosmic. And it was, as I said, both inspirational to me, because it got me thinking about carving up languages into different fragments and trying to assemble different languages out of the pieces. But it was also validating for me, because every man like me does view programs from this really high level as a precise description of a mapping between inputs and outputs.
This was validating for me, because I've been looking at the world in this way for some time. Those of you who saw my Strangewood talk in 2015, I went on for a while about why wouldn't you just write down a web server in SQL, because SQL concisely and completely describes the behavior of the web server. All you have to do is squint your eyes and say, "I can imagine that a stream of requests can be modeled as records, and I can imagine the pages on disc can be modeled as records, and it's really just a natural joint between them. They have to be on the same server, and they have to be for the same file. Why wouldn't you write it down that way? Especially if it provided guarantee about how efficient the computation could be"
All this descriptive complexity stuff got me thinking, "Man, maybe what languages really are is that they're lenses that we put between ourselves in a domain in order to better study. And what does a lens do? It frames things. It puts some things in the frame, some things out, and it magnifies things, and hell, maybe it burns holes in things too. There's no reason why we should stop with a single lens. We should permit ourselves to grab a couple lenses and play amateur optician. "Hey, is that better? Number one or number two, what do you think?" But you certainly can do things, like what would happen if I took SQL and it wasn't quite big enough to hold my problem? Is there a couple of things from Erlang I could just add to it to get the language I wanted? Maybe languages are just lenses we can try out, we can throw out.
Then I started thinking about these fragments. Not the same fragments. I'm not a second order person. But I was a query person, I was a database person. I started thinking about SQL and the relational algebra operators out of which SQL is built. Any of these symbols mean anything to anybody? Just like an eye test. So, the symbol is selection. The basic idea is you have some rows, throw some of them away. This is projection. You have a row, throw some of the columns away. This is joined. You have a couple of rows, combine them and make and a new row, and then maybe throw it away, or throw away some of the columns. And you can combine those in any ways you want. The language that you get from combining those in any way you want is a very small language, a very minimally expressive language known as the conjunctive queries. But it's not that unexpressive. It was expressive enough to implement that web server. So you can write things in the conjunctive queries. And in exchange for having written them in such a small language, you get some really amazing properties.
In the conjunctive queries, you can decide containment. Containment is decidable. That is to say, I can write a procedure that given a query P and a query Q, I can say, is P contained in Q? Meaning, for every record that P produces, does Q also produce it? That's a huge thing for query optimization that's used all the time in query optimization when we can determine in a database that a pair of queries are conjunctive. But obviously, if you can do containment, you can do containment in both directions, and if P is contained in Q and Q is contained P, P and Q are equivalent, which means this is a language that you can write some interesting things, but not all of the things you might want to write, for which equivalence is decidable. That's huge, right?
I can have these two programs that look completely different. One could be a million lines and one could be two. You could put it into the decision procedure and say, they're the same program so you probably want to work with the one with two lines. These two programs that look totally different, one is fast, one is slow. Well, guess what, they're equivalent, so you should run the faster one. That's amazing. Unfortunately, if we extend this language in the tiniest bit with anything, we lose decidability for containment. If we add, for example, negation, we arrive at the aggregation-free subset of SQL, cool language that you can write some cool stuff, richer things in the web server. Unfortunately, decidability and containment goes out the window.
It's also interesting to ask, well, what if we shift this thing over? Well, if we shift this thing over and say, “Forget about negation, I don't want it, but I want the ability to do least fixed points. I want to defne things recursively. I want to be able to define the transitive closure over relation, or paths, and graphs, or things like this.” We get a language called Datalog. Containment isn't decidable in Datalog either. But Datalog has some really nice properties that you get in exchange for having a small language. In Datalog, all computations terminate. All computations fall. In Datalog, all computations produce a unique result irrespective of the order in which data arrived and the order in which computation was scheduled. Determinism despite non-determinism. This sounds really compelling, maybe this is just the language that I want. I want a language that I can run in distributed systems where all the bad things that happen in distributed systems don't happen to me.
Most of the bad things that happen in distributed systems have to do with non-determinacy and timing. If I can write down programs in a language that doesn't give a rat's ass about timing, I have a language that can capture the eventually consistent programs, that's awesome. Let's just write our systems in Datalog. Then I'm like, "Oh, wait a minute, I had it wrong.” Maybe what I learn from descriptive complexity is not that languages are lenses that frame a subjective. Really what they are is lassos that you capture a subject and you cinch it down. You want the smallest language that captures the programs that you want to write, but it's small enough that you can't write anything else, because in exchange for that smallness you get provability, and you potentially get efficiency. Context, context, context.
Let's talk about context for a minute. I promised I wasn't going to show you, or did I say this at the beginning of the talk? Maybe I forgot to say this. I forced myself, in this talk, to not show any syntax, because I didn't want it to be about syntax, I wanted it to be about the process and stuff. Turned out that was super hard to give a talk about a language without showing any syntax. I think I can get away with this, because this is just a record. This isn't the syntax in my language. So, in Datalog, I can talk about knowledge. I can talk about a particular piece of knowledge, a string called details that is in a predicate called knowledge. So you ask yourself, "What if I wanted to use Datalog to capture distributed computing, where at the very least I would need to also talk about space, the location of data?" Because it might be the case that somebody knows the details and somebody else doesn't, and if we want everybody to know we have to invent a protocol to move the data around.
Well, that's easy, because in some sense- let me say this slowly- space has no semantics. Space is just a name of a thing, so we can just add another column to a datalog relation and model the fact that host one knows this but host two doesn't. That's interesting. I didn't need to change anything to model space. Space doesn't mean anything, it's just a name. Sure we have to write and figure out how to route to the host, but that's not a languages problem. What we've done is we've added context and set a richer thing.
Now, two things that Datalog can't do that we need to do if we want to be able to model the sorts of systems that we write is, it can't model time-varying state. It can talk about knowledge but it can't talk about that knowledge coming into being, and then being lost or changing or something like that. Certainly, these things can happen in real life. Maybe more fundamentally, it can't capture all that uncertainty that makes these systems so hard. I can't talk about what it means to be unsure about when you receive a message that I sent. It occurred to me in a dream that maybe you could kind of kill both of these birds with one stone by slotting in one more piece of context. It's one thing for something to be known, it's one thing for it to be known by an agent, and it's another thing for it to be know by an agent at a particular time.
Now, we can start to model a time before 27 where it wasn't known, or a time after 27 when it changes to some other knowledge. Then you can keep turning this crank. Because this thing kind of looks like a register, some host has a current value that could be a different value at another time, and if we wanted to generalize from a register to something like a key-value store, we would just have to add one more column that contextualized the value in the context of the key that it belonged to. So this adding context game works really well when we're modeling the state of our universe uniformly as records.
Unfortunately, Datalog language didn't give us any way to talk about time changing. So we'll need at the very least to be able to model state change to add two things to Datalog, a successor function so we can talk about time marching forward, which changes the expressivity of our language, of course. And we have to be able to talk about negation too. Because if we want to model mutable stain, we have to talk about what changes, and what doesn't change as the clock moves. So we need to be able to say not, and we need to be able to say plus one. But if we add not and plus one to Datalog we get a language that I thought I had invented in 2010, but it turned out somebody had already invented it. This guy named Bertram Ludascher created a language called Statelog in which he said, "Hey, if you add negation and plus one to Datalog, you get a Datalog that can model state machines." Which is pretty cool. And then the one last thing to add was something akin to please advance the clock by one that was more like, please admit that the clock changes in a way that I can't control, and when I send you a message and my clock is 27, what is your clock when you get the message? Can I even say that it's greater than 27? I don't think so.
I think I have to say, I can advance the clock by one, and I can make the clock jump to some random place. If I can capture both of those things in language, my language can talk about everything that can happen to distributed execution, that's pretty cool. Unfortunately, it yields a very big language. A Turing complete language in which I can express any program. This isn't surprising. Because there were all these protocols that I was trying to write, and I added to Datalog until I had a language rich enough to write them. This is syntax, but you can't read it, so it doesn't matter. So, I went ahead and, damn it, implemented two-phase commit and I was like, "Yes, that was cool, that felt right." And I implemented three-phase commit, "Oh, cool, that felt right." And I implemented Paxos. And in every case I needed all of those things. I needed least fixed point, I needed successor, I needed negation. And I began to ask myself why.
Then I took a step being further back, and I was like, "What is it about these protocols that they seem to require negation to express?” They seem to require, broadly speaking, non-monotonicity to express, and they also are the things that I deploy in a system when I'm concerned about non-determinism in ordering. There's some weird thing happening where negation is required to say these things, and it's also the reason why I might want them. My advisor, Joe Hellerstein, characterized this sort of dualism in this way. He said, "If you want to write a protocol that waits, typically you do it by counting, you count votes” or something like that. But if you want to count something, you have to wait until the thing you're counting has stopped changing. So you need a barrier. Sometimes there's this weird correspondence between waiting and counting. You need to count to wait, you need to wait to count.
Another way of saying that is there's something about not. We need not to write a protocol that expresses synchronization, and if our program has not in it, it might require synchronization as we saw in the case of Datalog. Which means there was another fragment that I should probably explore that I somehow neglected to explore before, which is the negation-free fragment of Datalist. I can talk about state evolving over time. I can combine data in a variety of interesting ways including performing looping. I can be honest about non-determinacy and message delivery order, and non-determinacy and what hosts I can expect to be up at one time or another.
And yet it yields a language which produces a deterministic outcome for all message delivery orders. That's pretty cool. I like to imagine that somebody much smarter than me, one day, will draw a chart that looks like Amerman's chart, that is trying to explore the relationship not between fragments of logic language and complexity classes, but between fragments of logic languages and fundamental requirements for synchronization. We can imagine that there are programs that we can write which require no synchronization whatsoever, and we can imagine that there are programs that we can write that are somewhere in this region up here that don't require huge synchronization, that we could find an appropriate synchronization regime for.
Discoveries: Stuff I Learned along the Way
By way of closing. This is weird, right? What is all this stuff about? What do all these things have in common? What does a lasso have in common with a lens, with a compass, with a magnet? One thing that they have in common is that none of them are the thing. None of them are the artifact. They're all tools that mediate between an object of study and a subject of study. Whether they're framing things, or shrink-wrapping things, it has to do with intimacy between a domain that we care about and a poor person who really cares about the domain. So I think what I took away from this experience of working with Datalist is that there's really only one good reason to design a language. It has nothing to do with external need, it has nothing to do with expectations of impact, it certainly has nothing to do with the look and feel of the language itself. It has everything to do with deciding that you want to understand the domain so badly, and because you've tried everything else that you want to get skin to skin with it, and there's no way to get closer to a domain to fit it so closely than to design a language around it.
I hope I have convinced you that it's not about the look. It's about the fit. That we shouldn't seek an external need to fill when we engage in language design. If we engage in language design, we're doing it to fulfill a really serious need of our own. The impact, well, I don't know where to do this, because I think there are a lot of different ways to measure impact. I will admit that after about, let's see how long has it been, it's been almost 10 years since I designed Datalist. How many users do you think it has? Three? Are you one of them? Because I only know of the one. So, yes, it only had one user. So it's a failure. No, I don't think so.
I think it was the biggest success of my life. I think that Datalist is my boy. Datalist is this thing that I use to frame my understanding of distributed systems problems. In fact, in meetings it's often the case that somebody is explaining something, I'm saying, "Wait, explain it again, I don't get it. Wait, explain it, hold on, let's do it Datalist on the whiteboard." "I see what's going on there." Datalist is how I understand distributed systems now, it's my internal language, I dream in it. Pretty much all the good research that I've done since then has been built on it. Sure, nobody read this paper, I couldn't get it, because I have to do a good workshop. Nobody read the paper, and that's not true. I read it a hundred times, and I used it as the basis for Lineage-driven Fault Injection, which is the main research project at my lab. I've used it in applications like protocol repair, consistency analysis, a variety of kinds of distributed debugging. It frames the way that I think about the world, and it's my greatest success. Poor lucky me. Can anybody guess what the lies was?
Participant 1: You are a researcher.
Alvaro: Yes. I was talking about this talk to a PL researcher friend of mine, and she's like, “It sounds like all these surprising things that you learned, A, aren't surprising, and B, were because you weren't doing language design, you were doing PL research.” So, in some sense, that was the lie. I am a PL researcher, and so can you.
See more presentations with transcripts