1 Google Research, 2 Stanford University
This document provides an overview of the material covered in a course taught at Stanford in the spring quarter of 2018. The course draws upon insight from cognitive and systems neuroscience to implement hybrid connectionist and symbolic reasoning systems that leverage and extend the state of the art in machine learning by integrating human and machine intelligence. As a concrete example we focus on digital assistants that learn from continuous dialog with an expert software engineer while providing initial value as powerful analytical, computational and mathematical savants. Over time these savants learn cognitive strategies (domain-relevant problem solving skills) and develop intuitions (heuristics and the experience necessary for applying them) by learning from their expert associates. By doing so these savants elevate their innate analytical skills allowing them to partner on an equal footing as versatile collaborators — effectively serving as cognitive extensions and digital prostheses, thereby amplifying and emulating their human partner's conceptually-flexible thinking patterns and enabling improved access to and control over powerful computing resources.
Suppose you could merely imagine a computation, and a digital prostheses, an extension of your biological brain, would turn it into code that instantly realizes what you had in mind. Imagine looking at an image, dataset or set of equations and wanting to analyze and explore its meaning as an artistic whim or part of a scientific investigation. I don't mean you would use an existing software suite to produce a standard visualization, but rather you would make use of an extensive repository of existing code to assemble a new program analogous to how a composer draws upon a repertoire of musical motifs, themes and styles to construct new works, and tantamount to having a talented musical amanuensis who, in addition to copying your scores, takes liberties with your prior work, making small alterations here and there and occasionally adding new works of its own invention, novel but consistent with your taste and sensibilities.
Perhaps the interaction would be wordless and you would express your objective by simply focusing your attention and guiding your imagination, the prostheses operating directly on patterns of activation arising in your primary sensory, proprioceptive and associative cortex that have become part of an extensive vocabulary that you now share with your personal digital amanuensis. Or perhaps it would involve a conversation conducted in subvocal, unarticulated speech in which you specify what it is you want to compute and your assistant asks questions to clarify your intention and the two of you share examples of input and output to ground your internal conversation in concrete terms.
More than thirty years ago, Charles Rich and Richard Waters published an MIT AI Lab technical report  entitled The Programmer's Apprentice: A Research Overview. Whether they intended it or not, it would have been easy in those days for someone to misremember the title and inadvertently refer to it as "The Sorcerer's Apprentice" since computer programmers at the time were often characterized as wizards and most children were familiar with the Walt Disney movie Fantasia, featuring music written by Paul Dukas inspired by Goethe's poem of the same name1. The Rich and Waters conception of an apprentice was certainly more prosaic than the idea described above2, but they might have had trouble imagining the amount of code available in open-source repositories and the considerable computational power we carry about on our persons or can access through the cloud.
In any case, you might find it easier to imagine describing programs in natural language and supplementing your descriptions with input-output pairs. The programs could be as simple as regular expressions or SQL queries or as complicated as designing powerful simulators and visualization algorithms. The point is that there is a set of use cases that are within our reach now and that set will grow as we improve our natural language understanding and machine learning tools. I simply maintain that the scope of applications within reach today is probably larger than you think and that our growing understanding of human cognition is helping to substantially broaden that scope and significantly improve the means by which we interact with computers in general and a new generation of digital prostheses in particular. Here are just a few of the implications that might follow from pursuing a very practical and actionable modern version of The Programmer's Apprentice:
Develop systems that enable human-machine collaboration on challenging design problems including software engineering:
The objective of this effort is to develop digital assistants that learn from continuous dialog with an expert software engineer while providing initial value as powerful analytical, computational and mathematical savants. Over time these savants learn cognitive strategies (domain-relevant problem solving skills) and develop intuitions (heuristics and the experience necessary for applying them) by learning from their expert associates. By doing so these savants elevate their innate analytical skills allowing them to partner on an equal footing as versatile collaborators — effectively serving as cognitive extensions and digital prostheses, thereby amplifying and emulating their human partner's conceptually-flexible thinking patterns and enabling improved access to and control over powerful computing resources.
Leverage and extend the current state of the art in machine learning by integrating human and machine intelligence:
Current methods for training neural networks typically require substantial amounts of carefully labeled and curated data. Moreover the environments in which many learning systems are expected to perform are partially observable and non-stationary. The distributions that govern the presentation of examples change over time requiring constant effort to collect new data and retrain. The ability to solicit and incorporate knowledge gleaned from new experience to modify subsequent expectations and adapt behavior is particularly important for systems such as digital assistants with whom we interact and routinely share experience. Effective planning and decision making rely on counterfactual reasoning in which we imagine future states in which propositions not currently true are accommodated or steps taken to make them true . The ability for digital assistants to construct predictive models of other agents — so-called theory-of-mind modeling — is critically important for collaboration .
Draw insight from cognitive and systems neuroscience to implement hybrid connectionist and symbolic reasoning systems:
Many state-of-the-art machine learning systems now combine differentiable and non-differentiable computational models. The former consist of fully-differentiable connectionist artificial neural networks. They achieve their competence by leveraging a combination of distributed representations facilitating context-sensitive, noise-tolerant pattern-recognition and end-to-end training via backpropagation. The latter, non-differentiable models, excel at manipulating representations that exhibit combinatorial syntax and semantics, are said to be full systematic and compositional, and can directly and efficiently exploit the advantages of traditional von Neumann computing architectures. The differences between the two models are at the heart of the connectionist versus symbolic systems debate that dominated cognitive science in 80's and continues to this day [60, 29]. Rather than simulate symbolic reasoning within connectionist models or vice a versa, we simply acknowledge their strengths and build systems that enable efficient integration of both types of reasoning.
Take advantage of advances in natural language processing to implement systems capable of continuous focused dialog:
Language is arguably the most important technical innovation in the history of humanity. Not only does it make possible our advanced social skills, but it allows us to pass knowledge from one generation to the next and provides the foundation for mathematical and logical reasoning. Natural language is our native programming language. It is the way we communicate plans and coordinate their execution. In terms of expressiveness, it surpasses modern computer programming languages, but its capability for communicating imprecisely and even incoherently, and our tendency for utilizing that capability makes it a poor tool for programming conventional computers. That said it serves us well in training scientists and engineers to develop and apply more precise languages, and its expressiveness along with our facility using it make it an ideal means for humans and AI systems to collaborate. The consolidation and subsequent recall and management of episodic memory is a key part of what makes us human and enables our diverse social behaviors. Episodic memory makes it possible to create and maintain long-term relationships and collaborations [64, 53, 59].
Think seriously about how such technology might ultimately be employed to build brain-computer-interfaced prostheses:
This exercise primarily relies on the use of natural language to facilitate communication between the expert programmer and apprentice AI system. The AI system learns to use natural language in much the same way as a human apprentice would — as a flexible and expressive tool to capture and convey understanding and recognize and resolve misunderstanding and ambiguity. The AI system interacts with computing hardware through a highly instrumented integrated development environment. Essentially, the AI system can read, write, execute and debug code by simply thinking — reading and writing to a differentiable neural computing interface . It can also directly sense code running by reading from STDERR and STDIO, parsing output from the debugger and collecting and analyzing program traces. The same principles could be applied to develop digital prostheses employed for a wide range of intelligence-enhancing human-computer interfaces.
This document attempts to optimize for the student or software engineer knowledgeable about neural networks and interested primarily in understanding how one might go about building a system along the lines of the programmer's apprentice. It is my experience that this audience has relatively little appetite for details about relevant work in cognitive and systems neuroscience that has informed the design sketched in these pages. The course website for the class I taught at Stanford in the 2018 Spring quarter serves as an extensive resource for those reading this document. It includes all of the lectures and discussion notes for the class and I refer to it often in this document as a source of supplementary information.
This document is not intended to provide the reader with a short course in cognitive science, artificial intelligence, natural language processing, machine learning, artificial neural networks, or automated code synthesis / automatic inductive programming, and is certainly not intended to cover all these disciplines in any but the most cursory of detail. The primary goal is to explore the possibility of building digital assistants that considerably extend our ability to solve complex engineering problems with a emphasis here on software engineering. A secondary goal is to explain how the field of neuroscience is helping to achieve our primary goal3.
The fields of cognitive and systems neuroscience are playing an important role in directing and accelerating research on artificial neural network systems. Much of this work predates and helped give rise to the especially exciting work on connectionist models in the 1980s. However, in the nearly 40 intervening years, a great deal of progress has been made, much of it due to improved methods for studying the behavior of awake behaving animal subjects and human beings in particular. Indeed, this work is undergoing a renaissance fueled by even more powerful methods for observing brain activity in human beings in the midst of solving complex cognitive tasks.
The field of automatic programming, after decades of steady, often quite practical research on using symbolic methods — much of it originating in labs outside the United States, is seeing a renewed interest in artificial neural networks. It remains to be seen whether artificial neural networks will have a significant impact on code synthesis, however there appear to be opportunities to leverage what we know about both natural and artificial neural networks to make progress, and hybrid systems that combine both connectionist and traditional symbolic methods may have the best chance of pushing the state-of-the-art significantly beyond its present level.
We begin with the problem of how to represent information in memory. In the case of the programmer's apprentice, relevant information includes the type of items that software engineers routinely think about in plying their trade such as algorithms, data structures, interfaces, programs, subroutines and tools such as assemblers, compilers, debuggers, interpreters, parsers and syntax checkers. Then there are the things that programmers generally do not think about explicitly but that concern how they solve problems and organize their thoughts, including, for example, the design strategies we learn in computer science courses such as divide-and-conquer, dynamic-programming and recursion. Finally, there is strategic organizational information of a sort that plays a role in any complex individual or collaborative effort including plans, tasks, subtasks, specifications and requirements.
All of this information has to be encoded in memory and made accessible when required to perform cognitive tasks. Information, whether in a computer or a brain, tends to move around depending on what is to be done with it, and, at least in biological brains, it is constantly changing6. In biological brains, it is difficult if not impossible to think about something without changing it. In building systems inspired by biological brains we have somewhat more control over such changes, but control comes at a cost. We make no distinction between concrete and abstract thoughts — all thoughts are abstract whether they represent atoms or bits. We will on occasion refer to memories as being short- or long-term but the distinction doesn't begin to address real issues. When we talk about episodic memory, it may seem that we are referring to some sort of permanent or archival memory, but that's not the case.
Since this document is more condensed precis than unabridged thesis, we need some way of navigating the huge space of ideas relating to biological and artificial brains as they pertain to building digital assistants and automatic programming. I'll begin by pointing out that language, programs and plans are all usefully thought of as having hierarchical, recursive structure. It also makes sense to think of brains as being organized as such [4, 47, 21, 32, 20, 42], and the human brain apparently employs hierarchical models to make sense of the world in which it evolved.
To the untutored mind, the world is essentially flat. We impose hierarchical structure to make understanding it more tractable. We ingest sequences of observations as input and execute sequences of actions as output. What goes on between is complicated. Rather than immediately focusing on how biological and artificial brains learn and apply hierarchical models, we start by considering the simpler problem of how we might represent a subroutine, the smallest fungible unit of activity for our purposes. Subroutines can be used to kick a soccer ball or implement simple program transformations in a neural-network architecture.
The following assumes familiarity with artificial neural networks7. We begin with the simplifying assumption that subroutines can be represented as tuples consisting of a set of operands represented as high-dimensional embedding vectors, a weight matrix representing the transformation and a product vector space in which to embed the result. In applying this idea to program transformations, assume that each operand corresponds to the embedding of an abstract-syntax-tree representation of a code fragment, w.l.o.g., any non-terminal node in the AST of a syntactically well-formed program. In the remainder of this section and the next, we use the following abstractions and abbreviations:
The second introductory lecture (HTML) for the Stanford course associated with this document provides a high-level overview of the relevant research in cognitive and systems neuroscience. Annotations of the form "Slide #" link to relevant slides in the introductory lecture. For example Slide 2 covers the key anatomical landmarks in the human brain mentioned in this document. While the material in the remainder of this section refers to work in cognitive and systems neuroscience, the discussion here emphasizes applications of what we've learned from neuroscience, and so the interested reader is encouraged to at least skim the above-linked lecture notes.
Referring to the abstractions and abbreviations introduced in the previous section, reading a program from STDIO — the analog of a human programmer reading a program displayed on a monitor — will result in — at least — two different internal representations of the resulting AST: an embedding vector in the SMS and a key-value representation in the DNC. The former allows us manipulate programs and program fragments as fully-differentiable representations within distributed models. The latter allows us to modify, execute and share code in a human-accessible format, fully compatible with our software-development toolchain.
Following , we assume EMS consists of initial-state-action-reward-next-state tuples of the form (st,at,rt,st+1). State representations st have to be detailed enough to reconstruct the context in which the action is performed and yet concise enough to be practical. Suppose the PFC directs the activation of selected circuits in the SMS via the global workspace (GW) — see Slides 3 and 5 — in accord with Dehaene et al [24, 22] assuming a prior that generates low-dimensional thought vectors . The state representation st encodes the attentional state that served to identify representations in SMS relevant to at allowing the EHC to produce the resulting state st+1. Given st we can reproduce the activity recorded in the EMS, and, in principle, incorporate multiple steps and contingencies in a policy constituting a specialized program-synthesis or program-repair subroutine.
Such subroutines would include repairing a program in which a variable is introduced but not initialized, or when it is initialized but ambiguously typed or scoped. As another example, a variable is initialized as VOID and subsequently assigned an integer value in some but not all branches of a conditional statement. Other examples of repair routines include problems with the use of comparison operators, e.g., conditional branches both with ≤, the is operator is used instead of is not, or vice versa, confusion involving A is not None, A not None and A != None, and problems involving class methods, e.g., when self accessor is missing from a variable, e.g., mode = 'manual' instead of self.mode = 'manual' [72, 25, 82].
Attentional machinery in the prefrontal cortex (PFC) populates the (GW) by activating circuits relevant to the current input and internal state, including that of the DNC and any ongoing activity in (SMS) circuits produced by previous top-down attention and bottom-up sensory processing. The PFC in its role as executive arbiter identifies operators in the form of policy subroutines and then enlists the EHC to — using terminology adapted from Von Neumann machines — to load registers in short-term memory and perform operations by using fast weights to transform the contents of the loaded registers into product representations that can either be fed to associative embeddings, temporarily stored in other registers or used to modify the contents of the DNC thereby altering the AST representation of the target code and updating the display to provide feedback to the human programmer.
The primate cortex appears to be tiled with columnar structures referred to as cortical columns. Some neuroscientists believe that all of these columns compute the same basic function. However, there is considerable variation in cell type, thickness of the cortical layers, and the size of the dendritic arbors to question this hypothesis. The prefrontal cortex is populated with a type of neuron, called a spindle neuron, similar in some respects to the pyramidal cells found throughout the cortex, that allow rapid communication across the large brains of great apes, elephants, and cetaceans. Although rare in comparison to other neurons, spindle neurons are abundant and quite large in humans and apparently play an important role in consciousness and attentional networks — see Slide 4.
The corresponding artificial neural network architecture for the programmer's apprentice application consists of a hierarchy of specialized networks with a relatively dense collection of feedforward and feedback connections that enable recurrent state, attentional focus and the management of specialized memory systems that persists across different temporal scales — see Slide 6. Individual networks are specialized to serve different types of representation, employing convolutional networks, gated-feedback recurrent networks and specialized embedding models. All of these networks are distributed representations that encode information in high-dimensional vector spaces such that different dimensions can be trained to represent different features allowing attentional mechanisms to emphasize or modify encodings so as to alter their meaning.
These attentional networks are connected to regions throughout the cortex and are trained via reinforcement learning to recognize events worth attending to according to the learned value function. Using extensive networks of connections — both incoming and outgoing, attentional networks are able to create a composite representation of the current situation that can serve a wide range of executive cognitive functions, including decision making and imagining possible futures. The basic idea of a neural network trained to attend to relevant parts of the input is key to a number of the systems that we'll be looking at.
To understand attentional networks, think about an encoder-decoder network for machine translation. As the encoder digests each word in the sequence of words that constitute the input sentence, it produces a representation — Geoff Hinton refers to these as thought clouds in analogy to the iconic clouds that you see in comic strips — of the sentence fragment or prefix that it has seen so far. Because the sentence is ingested one word at a time — generally proceeding from left to right — the resulting thought cloud will tend to emphasize the meaning of the most recently ingested words in each prefix. You could encode the entire input sentence and then pass the resulting representation on to the decoder, but earlier words in the sentence will receive less attention that later words. Alternatively, you could introduce a new network layer that takes as input encodings of all the sentence prefixes seen so far and trains the new layer — thereby taking advantage of the power of gradient descent — to produce a composite representation that emphasizes those parts of the input that are most relevant in decoding / generating the next word in the output.
The programmer's apprentice is implemented as an instance of an hierarchical neural network architecture. It has a variety of conventional inputs that include speech and vision, as well as output modalities including speech and text. In these respects, it operates like most existing commercial personal assistants — see Slide 7. It differs substantially, however, in terms of the way in which the apprentice interacts with the programmer. It is useful to think of the programmer and apprentice as pair programming, with the caveat that the programmer is in charge, knows more than the apprentice does — at least initially, and is invested in training the apprentice to become a competent software engineer. One aspect of their joint attention is manifest in the fact that they share a browser window. The programmer interacts with the browser in a conventional manner while the apprentice interacts with it as though it is part of its body directly reading and manipulating the HTML using the browser API. The browser serves both programmer and apprentice as an encyclopedic source of useful knowledge as well as another mode of interaction and teaching.
The spatial relationships among the ganglion cells in the retina are preserved in the activity of neurons found in the primary visual — or striate — cortex. Most sensory and motor areas maintain similar modality-specific topographic relationships. Shown here, for example, are Wilder Penfield's famous motor and somatosensory homunculi depicting the areas and proportions of the human brain dedicated to processing motor and sensory functions. Scientists have observed that the area devoted to the hands tend to be larger among pianists, while the relevant areas in the brains of amputees typically become significantly smaller — see Slide 10.
We imagine the programmer's apprentice with a body part consisting of an instrumented integrated development environment (IDE). Alternatively you might think of it as a prosthetic device. It is not, however, something that you can simply remove or replace with an alternative device outfitted with a different interface or supporting different functions and expect it to immediately respond to your attempts to control it — it is not a plug-and-play device. Like the legs you were born with or the prosthesis replacing an amputee's severed arm, you have to learn how to use these devices. Architecturally, the apprentice's prosthetic IDE is an instance of a differentiable neural computer (DNC) introduced by Alex Graves and his colleagues at DeepMind. The assistant combined with its prosthetic IDE is neural network that can read from and write to an external memory matrix, combining the characteristics of a random-access memory and set of memory-mapped device drivers and programmable interrupt controllers. The interface supports a fixed number of commands and channels that provide feedback. You can think of it as roughly similar to an Atari game console — see Slide 11.
What is left out of this account so far includes how we might take advantage of semantics in the form of executing code and examining traces in order to better understand the consequences of the changes just made. Presumably, wrapping a code fragment in a function and executing the function with different input to examine changes in the state variables could be used as a distal reinforcement signal providing intermediate rewards useful in debugging subroutines. As pointed out earlier, subroutines designed to modify code are likely to involve many conditional choices and so it is important for subroutine policies to be highly conditioned on the status of specific state variables. Indeed a technique such as model-based parameter adaptation may be perfectly suited to providing such context-sensitive adaptations.
Perhaps this next observation seems obvious, but it is worth keeping in mind that the human brain does a great deal of (parallel) processing that never rises to the level of conscious attention. The executive control systems in the prefrontal cortex don't have micromanage everything. Every thought corresponds to a pattern of activity in one or more neural circuits in the brain or beyond in the peripheral nervous system. One pattern of activity inevitably leads to another in the same or another set of neurons. For example, patterns of activity that begin in the sensory cortex can lead to patterns of activity in the motor cortex and can have consequences elsewhere in the brain, e.g., in the cerebellar cortex resulting in speech, or external to the central nervous system as in the case of neurons that propagate through the peripheral nervous system causing muscles to contract and extend thereby making your limbs and torso move.
Every new observation, every act of creating a new thought or revisiting an old one produces even more activity in the brain resulting in new thoughts some of which are ignored as their reverberations weaken and die and others that spawn new thoughts and proliferate under the influence of reentrant production of activity and the active encouragement of conscious attention in a perpetually self reinforcing, reimagining and self-analyzing cycle of recurrent activity. Meta-reinforcement learning supports the sort of diverse activity one might expect from a system that selects activity to attend to and then makes available in the global workspace for ready access by other systems. Sustaining a collection of such activated circuits would help to provide a context, serve to maintain a stack of policies, guide switching between them, support caching partial results for later use, reconstructing necessary state as needed when restoring policy after a recursive descent.
When you think of building systems that can develop new algorithms it is instructive the think about the simple case of learning to sort lists from input-output pairs. The bubble sort algorithm is generally regarded as the easiest to come up with, but even then it is easier if you start with simple I/O pairs like [A,B] → [A,B], [B,A] → [A,B] and work up to longer lists — referred to as curriculum learning . As Dan Abolafia pointed out in his class presentation, it is relative easy to learn to sort lists of length no more than n, but substantially more difficult to learn an algorithm that works for lists of arbitrary length, without the ability to construct a simple inductive proof of correctness. Logic and basic theorem proving are certainly important in learning to write programs. You might want to look at the Coq proof assistant10 for a glimpse at the future of algorithm development.
|Figure 1: This figure highlights the primary architectural components mentioned in the main text superimposed over an anatomical rendering of the human brain identifying related cortical and sub-cortical landmarks. The triangle and three ovals and match the shape and color conventions employed in O'Reilly  where you will find a substantially more detailed explanation of the underlying biological model. The three gold square shapes denote abstract architectural structures and not anatomical features. The acronyms are expanded and explained here.|
Figure 1 shows a diagram of the human brain overlaid with a simplified architectural drawing. The box shapes represent abstract systems and the oval and triangular shapes represent anatomical features for which we can supply computational models. For example, the box labeled GW represents the global workspace which performs a particular function in the architecture, but actually spans a good portion of the neocortex. Whereas the triangle labeled BG represents a group of subcortical nuclei called the basal ganglia situated at the base of the forebrain.
The box labeled AST represents a form of sensory input corresponding to the ingestion of abstract syntax trees representing code fragments. The oval labeled SMS represents semantic memory and the box labeled DNC corresponds to a differentiable neural computer. When the system ingests a new program fragment the resulting AST is encoded in the SMS as an embedding vector and simultaneously as a set of key-value pairs in the DNC. Here we think of the DNC as a body part or external prosthesis with corresponding maps in the somatosensory and motor cortex that enable reading and writing respectively — see Slides 10 and 11 mentioned earlier.
Our explanation of the architecture proceeds top down, as it were, with a discussion of executive function in the prefrontal cortex. The GW provides two-way connection between structures in the prefrontal cortex and homologous structures of a roughly semantic character throughout the rest of neocortex thereby enabling the PFC to listen in on diverse circuits in the neocortex and select a subset of such circuits for attention. Stanislas Dehaene describes this process as one of the primary functions of consciousness, but we need not commit ourselves to such interpretation here.
Not only does the PFC selectively activate circuits but it can also maintain the activity such circuits indefinitely as constituents of working memory. Since this capability is limited by the capacity of the PFC, the content of working memory is limited and adding new constituents may curtail the activation of existing constituents. In practice, we intend to model this capability using meta-reinforcement learning  (MRL) in which the MRL system relies on the GW network to sample, evaluate and select constitutuent circuits guided by a suitable prior  and past experience and then maintain their activity by a combination of memory networks  and fast weights .
Meta-reinforcement learning serves a second complementary role in the PFC related to executive function. We will refer to the first role as MRL-A for "attention" and the second as MRL-P for "planning". MRL-A is trained to focus attention on relevant new sensory input and new interpretations of and associations among prior perceptions and thoughts. MRL-P is trained to capitalize on and respond to opportunities made available by new and existing constituents in working memory. Essentially MRL-P is responsible for the scheduling and deployment of plans relevant to recognized opportunities to act. These plans are realized as policies trained by reinforcement learning from traces of past experience or constructed on the fly in response to unexpected / unfamiliar contingencies by recovering and reimagining past activities recovered from episodic memory — see Slide 13.
MRL-A and MRL-P could be implemented as a single policy, but it is simpler to think of them as two coupled systems, one responsible for focusing attention by constantly assessing changes in (neural) activity throughout the global workspace, and a second responsible for overseeing the execution of plans in responding to new opportunities to solve problems. MRL-A is as a relatively straightforward reinforcement learning system independently performing its task largely a function of whatever neural activity is going on in the GW, its attentional network and the prior baked into its reward function. MRL-P could be implemented along the lines of the Imagination-Augmented Agent (I2A) architecture  or the related Imagination-Based Optimization  and Imagination-Based Planning  systems.
The remaining parts of the architecture involve the interplay between the PFC and the semantic and episodic memory systems as facilitated by the basal ganglia and hippocampus. If we had a policy pre-trained for every possible contingency, we would be nearly done — let MRL-A draw attention to relevant internal and external activity and then design a simple just-in-time greedy scheduler that picks the policy with the highest reward given the state vector corresponding to the current content of working memory. Unfortunately, the life of an apprentice programmer is not nearly so simple. The apprentice might listen to advice from a human programmer or watch someone solve a novel coding problem or repair a buggy program. Alternatively, it may be relatively simple to adapt an existing policy to work in the present circumstances. However, making progress on harder problems will depend on expert feedback or having an existing reward function that generalizes to the problem at hand.
The hippocampus is perhaps best known for its role in supporting spatial reasoning. A type of pyramidal neuron called a place cell has been shown to become active when an experimental animal enters an area of a maze that it has visited before. However, the hippocampus plays a much larger role in memory by representing not just the "where" of experience but also the "when". The manner in which we employ short- and long-term memory is very different. We might construct a representation of our current situation in short-term memory, drawing upon our long-term memory to provide detail.
The two memory systems are said to be complementary in that they serve different purposes, one provides an archival record of the past while the other serves as a scratchpad for planning purposes. In retrieving a memory there is a danger that we corrupt the long-term memory in the process of subsequent planning. This isn't simply an academic question, it is at the heart of how we learn from the past and employ what we've learned to think about the future. Our subtle memory systems enable us to imagine solutions to problems that humans have never faced, and account for a good deal of our incredible adaptivity. In several lectures, we will explore architectures that support such flexibility — see Slide 14.
The basal ganglia in cognitive models such as the one described by Randall O'Reilly's in his presentation in class, play a central role in action selection. This seems like a good opportunity to review how actions are represented in deep-neural-network implementations of reinforcement learning. Returning to our default representation for the simplest sort of episodic memory, (st,at,rt,st+1), it’s easy to think of a state s as a vector s ∈ ℝn and a reward r as a scalar value, r ∈ ℝ, but how are actions represented?
Most approaches to deep reinforcement learning employ a tabular model of the policy implying a finite — and generally rather small — repertoire of actions. For example, most of the experiments described in Wayne et al  (MERLIN) six-dimensional one-hot binary vector that maps a set of six actions: move forward, move backward, rotate left with rotation rate of 30, rotate right with rotation rate of 30, move forward and turn left, move forward and turn right. The action space for the grid-world problems described in Rabinowitz et al  (ToMnets) consists of four movement actions: up, down, left, right and stay — see Slide 15.
The programmer's apprentice (PA) operates on programs represented as trees, where the set of actions includes basic operations for traversing and editing trees — or more generally directed-graphs with cycles if you assume edges in abstract syntax trees corresponding to loops, recursion and nested procedure calls, i.e., features common to nearly all the programs we actually care about. We still have a finite number of actions since for any given project we can represent the code base as a directed-acyclic graph with annotations to accommodate procedure calls and recursion, and use attention to direct and contextualize a finite set of edit operations11.
Pritzel et al  employ a semi-tabular representation of an agent's experience of the environment possessing features of episodic memory including long-term memory, sequentiality and context-based lookups. The representation called a differential neural dictionary (DND) is related to Graves et al  DNC. The programmer's apprentice is better suited to Vinyals et al  related idea of a pointer-network designed to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence — see related work in natural language processing by Merity et al  on pointer sentinels.
|Figure 2: The sequence-to-sequence encoder-decoder attentional model shown here uses a specialized memory called a pointer network to construct a short summary of a source document by flexibly combining phrases from the source document with words from its existing vocabulary. For each decoder timestep a generation probability Pgen ∈ [0, 1] is calculated, which weights the probability of generating words from the vocabulary, versus copying words from the source text. The vocabulary distribution and the attention distribution are weighted and summed to obtain the final distribution, from which we make our prediction. Note that out-of-vocabulary article words such as "2-0" are included in the final distribution. — adapted from See et al .|
One approach involves representing a program as an abstract syntax tree and performing a series of repairs that involve replacing complete subtrees in the AST. It might be feasible to use some variant of the pointer-network concept, e.g., ,  and  or neural programmer framework , but there are limitations with all of the alternatives I've run across so far, requiring additional innovation to deal with the dynamic character of editing AST representations, but at least the parsing problem is solved for us — all we have to do is make sure that our edits maintain syntactic well-formedness.
Most of the existing pointer-network applications analyze / operate on a fixed structure such as a road map, e.g., examples include the planar graphs that Oriol Vinyals addresses in his paper , recognizing long-range dependencies in code repositories , and annotating text to support summarization . Student projects focusing on program-repair might try ingesting programs using an LSTM, creating a pointer-network / DNC-like representation of the AST and then altering selected programs by using fragments from other programs, but be advised this approach will likely require inventing extensions to existing pointer-network techniques.
One possibility for training data is to use the ETH / SRI Python dataset that was developed by Veselin Raychev as part of his thesis on automated code synthesis12. Possible projects include designing a rewrite system for code synthesis based on NLP work from Chris Manning's lab led by Abigail See focusing on text summarization leveraging pointer networks — see Figure 2 for a schematic description of the model from their paper . Further afield are program synthesis papers that work starting from specifications like Shin et al  out of Dawn Song's lab or recent work from Rishabh Singh and his colleagues .
Another possibility is to use RL to learn repair rules that operate directly on the AST using various strategies. It's not necessary in this case to represent the AST as a pointer network, but, rather, take the expedient of simply creating a new embedding edited AST after each repair. We can generate synthetic data by taking correct programs from the ETH / SRI dataset and introducing bugs and then use these to generate a reward signal, with harder problems requiring two or three separate repairs.
It might also be worth exploring the idea of working with program embedding vectors in a manner similar to performing arithmetic operations on word vectors in order to recover analogies — see the analysis of Levy and Goldberg  in which they demonstrate that analogy recovery is not restricted to simple neural word embeddings. For example, given the AST for a program P with subtree Q and two possible repairs that correspond to replacing Q with either R or R', can we determine which is the better outcome A = P - Q + R or A' = P - Q + R' and might it serve as a distal reward signal to expedite training?
I also recommend Reed and de Freitas  for its application of the idea of using dynamically programmable networks in which the activations of one network become the weights (program) of another network. The authors note that this approach was mentioned in Sigma-Pi units section of Rumelhart et al , appeared in Sutskever and Hinton  in the context of learning higher order symbolic relations and in Donnarumma et al  as the key ingredient of an architecture for prefrontal cognitive control.
Michael Graziano's presentation on machines that incorporate an internal model of what consciousness is and attribute that model to themselves and others to make predictions about human behavior 13.
Randall O'Reilly's presentation on learning mechanisms that rely on a computational model of the prefrontal cortex to control both itself and other brain areas in a strategic, task-appropriate manner 14.
Matt Botvinick's presentation describing a new model of reward-based learning in which a traditional dopamine system trains the prefrontal cortex to operate as its own free-standing learning system 16.
The brain didn't evolve to accommodate language, rather, language evolved to accommodate the brain . Biological and cognitive constraints determine what types of linguistic structure are learned, processed and transmitted from person to person and generation to generation. Language acquisition has comparatively little to do with linguistics and is probably best viewed as a form of skill acquisition. Indeed, we are constantly processing streams of sensory information into successively more abstract representations while simultaneously learning to recode the compressed information into hierarchies of skills that serve our diverse purposes [15, 16].
Contrary to what some textbook authors might think, students learn to code by writing programs, a process that can be considerably accelerated by timely communication with peers and invested collaborators. In the case of unequal skill levels, communication tends to be on the terms of the more capable interlocutor, and the onus of understanding on the less capable partner in the collaboration. To sustain the collaboration, we need to bootstrap the apprentice to achieve a reasonble threshold level of competence in both language and in working with computers so as to compensate for expert programmer's investment in effort. From a value-proposition perspective, the apprentice has to provide net positive benefit to the programmer from day one.
It is important to keep in mind that any program specification whether in the form of input/output pairs or natural language descriptions when communicated between individuals of differing expertise is just the beginning of a conversation. Experts are often careless in specifying computations and assume too much of the student. Students, on the other hand, can surprise experts with their logical thinking while frustrate and disappoint with their difficulty in handling ambiguity and analogy, particularly of the esoteric sort familiar to professional programmers. Input from an expert will typically consist of a stream of facts, suggestions, heuristics, shortcuts, etc., peppered with clarifications, interruptions and other miscellany.
We propose a hybrid system for achieving competence in communicating programming knowledge and collaboratively generating software by creating a non-differentiable conventional dialogue management system that works in tandem with a differentiable neural-network dialogue system (NDS) that will become more competent as it gains experience coding and collaborating with its expert partner. The deployment of these two language systems will be controlled on a sentence-by-sentence basis by a meta-reinforcement learning (MRL) system that will depend less and less on the more conventional system but likely never entirely eclipse its utility. The MRL system subsumes the role of a beam search or softmax layer in an encoder-decoder dialogue model.
The conventional system will be built as a hierarchical planner17 following in the footsteps of the CMU Ravenclaw Dialogue System  and will come equipped with a relatively sophisticated suite of hierarchical dialogue plans for dealing with communication problems arising due to ambiguity and misunderstanding. While clumsy compared to how humans handle ambiguity and misunderstanding, these dialogue plans are designed to resolve the ambiguity and mitigate the consequences of misunderstanding quickly and get the conversation back on track by attempting various repairs involving requests for definition, clarification, repetition and restatement in as inconspicuous manner as possible .
The conventional dialogue system will also include a collection of hierarchical plans for interpreting requests made by the expert programmer to alter programs in the shared editor space, execute programs on specified inputs and perform analyses on the output generated by the program, debugger and other tools provided in the integrated development environment (IDE) accessible through a set of commands implemented as either primitive tasks in the non-differentiable hierarchical planner or through a differentiable neural computer (DNC) interface using the NDS system that will ultimately replace most of the functions of the hierarchical-planner-based dialogue management system.
This dual mode dialogue system and its MRL controller allows the apprentice to practice on its own and rewards it for learning to emulate the less-flexible hierarchical-planner implementation. Indeed there is quite a bit that we can do to preload the apprentice’s basic language competence and facility using the instrumented IDE and related compiler chain. A parallel dialogue system implemented using the same hierarchical planner can be designed to carry out the programmer’s side of the conversation so as to train the NDS system and the meta-reinforcement learning system that controls its deployment utterance by utterance. We can also train domain-specific language models using n-gram corpora gleaned from discussions between pair programmers engaged in writing code for projects requiring the same programming language and working on similar programming tasks.
In a collaboration, figuring out what to say requires planning and a certain degree of imagination. Suppose you are the apprentice and you want to tell the programmer with whom you're collaborating that you don't understand what a particular expression does. You want to understand what role it plays in the program you are jointly working on. How do you convey this message? What do you need to say explicitly and what can be assumed common knowledge? What does the programmer know and what does she need to be told in order to provide you with assistance?
Somehow you need to model what the programmer knows. In planning what to say, you might turn this around and imagine that you're the programmer and ask how you would respond to an apprentice's effort to solicit help, but in imagining this role reversal you have be careful that you don't assume the programmer knows everything that you do. You need a model of what you know as well as a model of what the programmer knows. This is called Theory of Mind (ToM) reasoning and learning how to carry out such reasoning occurs in a critical stage of child development.
Shared knowledge includes general knowledge about programming, knowledge about the current state of a particular program you are working on, as well as specific details concerning what you are attending to at the moment, including program fragments and variable names that have been mentioned recently in the discussion or can be inferred from context. This sort of reasoning can be applied recursively if, for example, the apprentice wants to know what the programmer thinks it knows about what the apprentice knows. To a large extent we can finesse the problem of reasoning about other minds by practicing transparency, redundancy and simplicity so that both parties can depend on not having to work hard to figure out what the other means. However, there are some opportunities in the programmer's apprentice problem for applying ToM reasoning to parts of the problem that cannot be so easily finessed.
Suppose that the apprentice has started a new program using an existing program P following a suggestion by the expert programmer. Realizing that the body of a loop in P is irrelevant to the task at hand, the apprentice replaces the useless body B with a fragment from another program that does more or less what is required and then makes local changes to the fragment to create a new body B′ so that it works with the extant loop variables, e.g., loop counter, termination flag, etc. When the assistant has completed these local changes, the programmer intervenes and changes the name of a variable in B′. What induced the programmer to make this change?
The programmer noticed that the variable in B′ was not initialized or referenced in P but that another variable that was initialized in P and is no longer referenced — it only appeared in the original loop body B, is perfectly suited for the purposes of the new program. Assume for the sake of this discussion, that the programmer does not explain her action. How might the assistant learn from this intervention or, at the very least, understand why it was made? A reasonable theory of mind might assume that agents perform actions for reasons and those reasons often have to do with preconditions for acting in the world, and, moreover, that determining if action-enabling preconditions are true often requires effort. A useful ToM also depends on having a model allowing an agent to infer how preconditions enable actions by working backward from actions to enabling preconditions.
Imagine the following scene, there's a man holding the reins of a donkey harnessed to a two-wheeled cart — often called a dray and its owner referred to as a drayman — carrying a load of rocks. He makes the donkey rear up and by so doing the surface of the cart tilts, dumping the rocks onto the road which was clearly his intention given the appreciative nods from the onlooking pedestrians. This short video illustrates that, while this might seem an unusual way of delivering a load of rocks, most people think they understand exactly how it was done. Not so!
The fact is that, as with so many other perceptual and conceptual tasks, people feel strongly that they perceived or understood much more than in fact they did. For example, most people would be hard-pressed to induce a donkey to rear up and, if you asked them to draw the donkey harnessed to the cart with its load of stone, they would very likely misrepresent the geometric relationships involving the height of the donkey, how the harness is attached, how far off the ground the axle is located, the diameter of the wheels and the level of the cart surface and center of gravity of the load with respect to the axle's frame of reference. In other words, they would not have — and possibly never could have — designed a working version of the system used by the drayman.
Now imagine that the drayman has a new apprentice who was watching the entire scene with some concentration, anticipating that he might want to do the very same thing before the first week of his apprenticeship is complete. Sure enough, the next day the drayman tells the apprentice to take a load of bricks to a building site in town where they are constructing a chimney on a new house. He stacks the bricks in a pile that looks something like how he remembers the rocks were arranged on the dray the day before. Unfortunately the load isn't balanced over the axle and almost lifts the donkey off its feet. After some experimentation he discovers how to balance the weight so the donkey can pull the load of bricks without too much effort.
When he finally gets to the building site, he nearly gets trampled by the donkey in the process of repeatedly trying to induce the distressed animal to rear up on its hind legs. Finally, one of the brick masons intervenes and demonstrates the trick. Unfortunately, the bricks don’t slide neatly off the dray as the rocks did for the experienced drayman the day before, but instead the bricks on the top of the stack tumble to the pavement and several break into pieces. The helpful brick mason suggests that in the future the assistant should prepare the dray by sprinkling a layer of sand on the surface of cart so that the bricks will slide more freely and that he should also dump the bricks on a softer surface to mitigate possible breakage. He then helps the assistant to unload the rest of the bricks but refuses to pay for the broken ones, telling the assistant he will probably have to pay the drayman to make up for the difference.
An interesting challenge is to develop a model based on what is known about the human brain explaining how memories of the events depicted in the video and extended in the above story might be formed, consolidated, and, subsequently, retrieved, altered, applied and finally assigned a value taking into account the possible negative implications of damaged goods and destroyed property. In the story above, the assistant initially uses his innate "physics engine" to convince himself that he understands the lesson from the master drayman, he then uses a combination of his physical intuitions and trial-and-error to load the cart, but runs up against a wall due to his unfamiliarity with handling reluctant beasts of burden. Finally, he gets into trouble with laws of friction and the quite-reasonable expectations of consumers unwilling to pay for damaged goods.
We don't propose to solve the general problems of theory-of-mind and physics-based reasoning in developing a programmer's apprentice, though the application provides an interesting opportunity to address particular special cases. As mentioned earlier, the stream of conversation between the assistant an expert programmer will inevitably relate to many different topics and specialized areas of expertise. It will include specific and general advice, reasons for acting, suggestions for what to attend to and a wide range of comments and criticisms. Several recent approaches for combining planning and prediction, especially in the case of partially observable Markov decision processes, are particularly promising for this application [37, 28, 30, 83, 73].
The apprentice will want to separate this information into different categories to construct solutions to problems that arise at multiple levels of abstraction and complexity during code synthesis. Or will it? We like to think of knowledge neatly packaged into modules that result in textbooks, courses, monographs, tutorials, etc. The apparent order in which activities appear in a stream of activities is largely a consequence of the context in which those activities are carried out. They may seem to arise in accord with some plan, as if assembled and orchestrated with a particular purpose in mind, but, even if there was plan at the outset, we tend to make up things on the fly to accommodate the sort of unpredictable circumstances that characterize most of our evolutionary history.
In some cases that context or purpose is used to assign a name, but that name or contextual handle is seldom used to initiate or identify the activity except in academic circumstances where divisions and boundaries are highly prized and made much of. The point of this is that in a diverse stream of activities — or utterances intended to instigate activities — credit assignment can be difficult. Proximity in terms of the length of time or number of intervening activities between a action and a reward is not necessarily a good measure of its value. We suggest it is possible to build a programer's apprentice or other sort of digital assistant that performs its essential services primarily by learning to predict actions, their consequences and their value from observing such a diverse stream of dialog intermixed with actions and observations.
When the programmer tells the assistant to replace the name of a variable in one location in a program with the name of a variable in another location, the process starts with a contextually rich representation in the programmer’s brain corresponding to the activation of millions or billions of neurons in circuits distributed broadly throughout the cortex. This pattern of activation is compressed into a more compact representation used to generate a sequence of words uttered one at a time as if condensing out of a cloud of commingled thoughts in droplets or phrasal showers uttered in sudden bursts of words that are — ignoring the intervening stages of speech production in the programmer and auditory processing in the assistant — subsequently converted into activations in peripheral subnetworks of the assistant and quickly propagate to other subnetworks throughout the assistant’s neural-network architecture.
The resulting activations insinuate fractal patterns of meaning into broadly distributed subnetworks of the apprentice subtly altering activity in some and substantially altering activity in others, contributing to the formation of another contextually rich representation in the apprentice’s brain. The imperative conveyed by the programmer’s tone of voice produces a quick response. The apprentice performs a sequence of well rehearsed steps that involve activating a sequence of patterns in the non-differentiable interface connecting the assistant to the integrated development environment. This sequence is produced by recurrent networks operating much like the programmer’s speech production circuits. The resulting patterns produce a sequence of unambiguous words — requiring no additional context to interpret, that immediately produce the desired change and are displayed on the screen shared by the programmer and the apprentice.
What does this combination of expert-human biological computing, differentiable connectionist models and non-differentiable symbolic systems buy us? The human expert provides heuristic advice and technical guidance. The connectionist components enable natural language communication between human and machine and enable the system to discover and exploit structure in computer programs to facilitate code synthesis and reduce brute search. Finally, the symbolic components allow the connectionist components — and, by extension, the human expert — to directly engage with computers as prosthetic extensions in which compiling and running code is as natural as playing video games.
Neil Rabinowitz's presentation on learning a machine theory-of-mind model that relies on meta-learning to build mental models of the agents that it encounters from observations of their behaviour alone 18.
Greg Wayne's presentation on MERLIN a method for prediction in environments corresponding to partially observable Markov decision processes in which memory formation is guided by predictive modeling 19.
Oriol Vinyals' presentation on an approach for model-based plan construction, evaluation and execution applied to sequential decision making problems relying on a method of imagination-based forecasting 20.
Devi Parikh's public lectures on learning to conduct meaningful dialog with humans in natural, conversational language by grounding the conversation in shared visual experience, inferring its context from history 21.
Gulwani et al  provide an up-to-date survey of challenges and technologies in automated program synthesis. Machine learning is allocated only 10 of the more than 100 pages in the review with neural network-methods singled out as a separate section of 6 additional pages. My experience is that many current devotees are largely ignorant concerning the relevant history including the many successful applications for handling interesting special cases. I won't attempt to remedy that state of affairs here except to recommend that readers interested in program synthesis spend the time to familiarize themselves with the relevant work including both successes and notable failures.
For those of you primarily interested in neural-network methods, understanding this background is likely to prove useful in developing hybrid systems that combine connectionist models and more conventional symbolic methods. Already, the power of high-dimensional vector-space representations, context-sensitive embedding spaces and fully-differentiable models trained with backpropagation is winning converts among advocates of more traditional methods, even as the latest generation of neural-network experts working on what is popularly called neural program synthesis are rediscovering some of the same special cases that have been exploited in deductive and constraint-based approaches to automated code synthesis.
Given the title, you might have expected that this document would be all about automatic programming, but it's really about human augmentation and human-computer collaboration. Programming is all about representing procedural knowledge in an interpretable form, i.e., executable on some computational substrate. Teaching someone to program or, for that matter, teaching someone to do most anything nontrivial, is also about representing and communicating procedural knowledge in an interpretable form — the clearer and more precise the communication, the simpler and less knowledgeable the required substrate.
In this section, we attempt to identify special cases in which it is relatively easy for a human programmer to instruct an AI system how to facilitate the conversion of thoughts conveyed in natural language into working programs — see Graham Neubig's CS379C calendar page for a sample of his NLP work on code generation and learning dialog systems. We are not anticipating that we will be able to read minds so much as exploit shared knowledge in order to translate succinct descriptions of desired computations — along with an implicit understanding of what constitutes a suitable computational solution — into working software possibly with guarantees of its correctness and performance.
How does this perspective inform the discussion in this section? Prior to any training the apprentice comes equipped with a very basic language facility and an innate ability to work with computer programs, but the former is practically useless since the meta-learning controller hasn't been trained and the latter only produces results in the same way that a baby has virtually no control over its limbs and torso. It learns to communicate by reinforcement when it successfully carries out a command from the expert programmer. A built-in training system expedites this process and relieves some of the burden from the programmer by generating synthetic commands that serve to initialize the meta-learning controller allowing the assistant to achieve a basic facility for directly translating natural language commands exercising the integrated development environment.
The apprentice also comes pre-trained with a (semantic) language model trained on a corpus that includes a large vocabulary of practical and technical terms used by programmers in talking about programs and programming. The apprentice ingests a large corpus of programs and program fragments written in the language that it will use for writing new programs, resulting in an embedding space that encodes these programs and an encoder-decoder translation facility that allows it to read and write programs represented as abstract syntax trees. Were there more computer- / natural-language parallel corpora, supervised training of systems that depend on natural-language specification as input would be tempting. As it is, we have to rely on more subtle strategies.
Perhaps it is not surprising that many current neural-network approaches take advantage of neural-network technology originally designed for machine translation, question answering and related natural-language processing applications. Programs are linguistic objects with relatively simple syntax. Their syntax is well understood since linguistically adept computer scientists designed them. Parsing programs is trivial using standard compiler tools. Their semantics is revealed by running them, and not just their input-output behavior — execution traces can be used to reveal the meaning of every function and fragment.
Programming languages are syntactically unforgiving — a source of aggravation for beginning programmers, but we have sophisticated editors and syntax checkers that avoid most problems and there is no reason not to build them into the integrated development environment used by the programmer's assistant and shared with the programmer. Internally, the assistant can work with equivalent abstract syntax trees making it easy to ingest, embed, manipulate and generate proposals for performing program transformations on such representations. Human readable code can be recovered for the programmer's convenience.
Successful code synthesis can be verified by running the program and comparing with the provided input-output examples. Modulo the intractability of the halting problem, failure for relatively simple programs can be easily determined. Program correctness is generally specified with respect to a specification and hence is more difficult to pin down, though the term generally implies the existence of a mathematical proof and, hence, one formal-methods approach to code synthesis involves generating a constructive proof of correctness.
Semantic embeddings are increasingly common [25, 18, 82, 87, 63] based on execution traces, event logs, or program invariants. See this discussion log entry and Rishabh Singh's CS379C calendar page plus this discussion log entry and Dawn Song's CS379C calendar page for more on semantic embeddings. I'm not going into detail here for the simple reason that it is still early days and your best approach to learn more is to read the papers, check out the slides and watch the videos mentioned in this paragraph and at the end of this section — you might also take a look at Danny Tarlow's list of selected program synthesis papers here.
In the Spring 2018 instance of Stanford CS379C, 30 students, organized in 12 teams proposed and carried out projects relating to the programmer's apprentice. As an example, the team consisting of Marcus Gomez, Nate Gruver, Michelle Lam, Rohun Saxena and Lucy Wang  decided based on their coding habits that the most helpful assistant would be one that could be trained to understand design through the structure of an application programming interface (API) and assist in the completion of the code through a divide-and-conquer approach PDF.
|Figure 3: Summary of the proposed LSTM-based state-embedding architecture for Cf described in Gomez et al . The code representation, Cf, is by default variable in size since the abstract syntax trees that comprise the individual helper functions are variable size. The authors solve this problem by training an autoencoder penalized on reconstruction loss such that the decoded output is the binary tree representation of the original AST.|
The system they envision would be designed to integrate seamlessly into programmer's workflow, minimize interaction and start from a simple header file and dependency graph. They took pains to standardize the format of the input in order to simplify learning a compressed fixed-size state vector as an ordered list of GLoVe vectors  — see Figure 3. They employ the method of generative adversarial imitation learning (GAIL) developed by Ho and Ermon  to extract a policy from data as if it were obtained by reinforcement learning following inverse reinforcement learning. GAIL should enable them to derive a model-free imitation-learning algorithm that obtains significant performance gains over existing model-free methods in imitating complex behaviors in large, high-dimensional environments.
Episodic memory is an important component of the programmer's apprentice. In order to exploit what you've learned through experience and previously consolidated in episodic memory, it is often necessary to reconstruct memories in enough detail that you can compare the past with current experience, determine if activities applied to resolve problems in the past apply in the present, and, if necessary, adapt those early responses to fit the present circumstances. Catherine Wong, in her final project report  for CS379C, developed a new neural-network model for content-based and selectively-reconstructive memory inspired by research on hippocampal indexing theory  and adaptive deconvolutional networks [90, 89] PDF.
In designing her model she built upon the Kavukcuoglu et al  work on convolutional predictive sparse decomposition and Kingma and Welling  on explicit feature disentanglement using variational autoencoders. While previous work has focused on deciding how to remember , Catherine's model emphasizes deciding what to remember, focusing on the problem of leveraging compressive coding to balance the demands of efficient and effective content-based lookup with domain specificity. Her work is also closely related to and consistent with recent research [48, 27, 55] on the role of high-frequency thalamic sleep spindles during sleep-dependent memory consolidation22.
Technology for automatic software repair is becoming an important component of software maintenance and automatic code synthesis . Maurice Chiang, Yousef Hindy, Peter Lu, Sophia Sanchez and Michael Smith  developed a neural network architecture to support error correction in abstract syntax trees that can be used as a code repair module in a more general neural code synthesis system PDF. Their approach is novel in its use of multiple strategies realized as separate submodules, coordinated by a master controller, so that the whole system is trained end-to-end after training each of the submodules independently.
The Lu et al work  takes advantage of a number of recent innovations in neural code synthesis. Specifically they leverage the Cai et al  work on using recursion to generalize programs to handle novel inputs. They employ a version of Pascanu et al  model-based planning (IBP) that accepts partially complete or incorrect ASTs, extending the IBP controller to coordinate the separate submodules. They also propose a strategy for handling intermediate reward signals to compensate for the relative sparsity of ground truth data and a clever approach to comparing programs that relies on normalizing and comparing stack traces computed from input-output pairs [12, 39, 74].
These three projects illustrate how ideas from such seemingly disjoint disciplines as cognitive neuroscience and theoretical computer science can come together to create sophisticated new technologies23. Unfortunately, competition between disciplines has resulted in convenient lapses in memory exacerbated by strategic internal rebranding of ideas. The current period of increased collaboration illustrates the advantages of creating and maintaining cross-disciplinary ties. This generation of students is being exposed to ideas from a broad range of ideas relating to synthetic and biological computing and will hopefully pass on their interests to their students and more narrowly focused colleagues.
These innovative student projects target important problems that we face in designing complex collaborative AI systems such as the programmer's apprentice. Their proposed solutions highlight the variety and scalabilty of the architectural components now available that allow for the creative manipulation of the ungainly objects that comprise the inputs and outputs of software engineering practice, e.g., natural language specifications of arbitrary format, the encoding, analysis and procedural extraction of programming knowledge from dialog, and dealing with computer programs of arbitrary size and complexity and computed artifacts such as execution traces that defy explicit encoding due to their size and format variability.
The Programmer's Apprentice is AI complete in the sense that it is equivalent to that of making computers at least as intelligent as human beings. It isn't necessarily the part relating to code synthesis — though, depending on how you define fully automatic code synthesis starting from a natural language description, that too could be said to be AI complete. Assuming we are satisfied to build a system capable of assisting rather than replacing a programmer, then the hard problem is in the ability to interact easily with humans.
This is not to say that we can't build highly capable assistants in the near term. We simply have to constrain the problem appropriately, and I believe we can do much better than the current breed of personal assistants in designing a programmer's assistant that can considerably increase the productivity of professional software engineers and enable reasonably competent programmers to become much more effective. The speakers contributing to class discussions and students working on related projects demonstrate that many of the crucial features required of a programmer's assistant are within our grasp.
Such features include speaker- and task-dependent management of episodic memory including memory consolidation, procedural abstraction and control-policy / strategy integration, meta-level control of reinforcement learning and policy application integrating multiple sources of knowledge pertaining to programming and maintaining continuous, problem-focused dialog, and basic programming capabilities including program repair and simplification, identifying and repurposing existing program fragments to generate new programs, and translating natural language program specifications into working programs.
If you are serious about wanting to build some variant of digital amanuensis to assist in writing programs, the good news is that there are a lot of tools you can leverage. There is also some excitement and optimism derived from the scattered successes in applying modern connectionist models to the problem of automatic code synthesis. For those with the perspective to see beyond the new advances, there is the distinct possibility of combining the deep learning methods that have recently shown such promise with the considerable body of work on deductive and statistical methods applied to automatic program synthesis .
Daniel Abolafia's presentation on iterative optimization for program synthesis in the presence of a reward function over the output of programs, where the goal is to find programs with maximal rewards 24.
Graham Neubig's presentation on a novel neural architecture for parsing natural language descriptions into source code powered by a grammar model to explicitly capture the target syntax as prior knowledge 25.
Rishabh Singh's presentation on using a strong statistical model for semantic code repair to predict bug locations and exact fixes without access to information about the intended correct behavior of the program. 26.
Dawn Song and Xinyun Chen's presentation on program synthesis from input-output examples, tree-to-tree neural networks for program translation, and attention for program synthesis from natural language descriptions 27.