Complex behaviour emerges from interactions between objects produced by different generating mechanisms. Yet to decode their causal origin(s) from observations remains one of the most fundamental challenges in science. Here we introduce a universal, unsupervised and parameter-free model-oriented approach, based on the seminal concept and the first principles of algorithmic probability, to decompose an observation into its most likely algorithmic generative models. Our approach uses a perturbation-based causal calculus to infer model representations. We demonstrate its ability to deconvolve interacting mechanisms regardless of whether the resultant objects are bit strings, space–time evolution diagrams, images or networks. Although this is mostly a conceptual contribution and an algorithmic framework, we also provide numerical evidence evaluating the ability of our methods to extract models from data produced by discrete dynamical systems such as cellular automata and complex networks. We think that these separating techniques can contribute to tackling the challenge of causation, thus complementing statistically oriented approaches.

A preprint version of the article is available at ArXiv.

We have introduced a suite of algorithms based on mathematical notions acknowledged to fully characterize the concept of randomness1 . These algorithms enable us to study the algorithmic information dynamics of evolving systems, and to devise methods of reducing the dimensions of data2 grounded on the first mathematical principles of randomness. Deconvolution of data by generative sources can be viewed as the ultimate goal of some supervised and unsupervised machine learning algorithms. This is a hard problem. Unsurprisingly, these approaches often lose sight of their goal of causal decomposition and instead seek to identify shared features of data, construed as evidence of their possible common origin. For example, in signal processing, popular methods such as k-means3 or k-medoids4 define heuristics based on the minimization of distance among data points according to a specific metric. Other popular methods, such as support vector clustering5 and some traditional machine learning techniques, draw on probability distributions, regression and correlation techniques to achieve linear separation and produce different groups. For instance, most deep neural networks use gradient-based learning techniques to statistically map elements to a differentiable landscape. Another type of separation method relies on graph-theoretic properties. Methods that separate graphs by indices such as edge betweenness or the frequency of over-representation of certain subgraphs (for example, see ref. 6), or by more sophisticated criteria such as shared graph spectral features, fall into this category7,8,9,10. All these methods make the assumption that, because the objects they study share statistical, topological or algebraic features11, such objects may be generated by the same means or from the same sources.

Despite their wide use and popularity, current approaches to causation are still both fundamentally and pragmatically dependent on linear regression and correlation tests, ranging from Granger’s causality12 to transfer entropy13 and Pearl’s interventionist do-calculus14. Nevertheless, considerable progress has been made in these areas, in particular through the introduction of Pearl’s interventionist do-calculus14, even when they are limited by their high dependency on probability distributions.

Our approach contributes to the discussion of disruptive techniques for introducing symbolic computation and causation into machine learning so as to better deal with hierarchically structured data and inductive inference. To this end, we combine techniques from perturbation analysis as introduced by Pearl14 and algorithmic probability15.

For illustration purposes, let us consider the area of convolutional neural networks, one of the most promising approaches to image classification in machine learning, in which a set of primitive features is extracted from a distribution of images. In contrast, or in addition to convolutional neural networks, algorithmic causal deconvolution requires separating features by their most likely common generative mechanisms rather than by their most discriminative statistical features, with those combinations that have the greatest variance serving to distinguish an image from all others. Synthesizing generative computer programs rather than using traditional pattern recognition amounts to producing a truly algorithmic explanatory generative model based on a deeper understanding of a causal mechanism than is possible through (non)linear regression. By way of analogy, storing angles in a deep convolutional neural network layer from a distribution of images does not (necessarily) mean that these angles put together in some order are responsible for generating the object itself, just as the digits of the Fibonacci sequence do not build the Fibonacci sequence by simply arranging themselves in the right order (something that may be difficult to scale and generalize), but are constructed from an algorithmic process implementing a formula able to produce the digits of the Fibonacci sequence in the natural order according to a generative process.

We complement our work with techniques borrowed from classical information theory when we cannot do better, but at the core of our approach is the seminal concept of Solomonoff induction15.

The notion of deconvolution is similar and related to the notion of decomposition in methods of pattern recognition16, classical information theory17 (a survey can be found in ref. 18) and clustering by lossless compression19 such as those methods based on information distances20 and compression—for instance, the so-called ‘normalized information distance’21, and its related measure, the ‘normalized compression distance’22, and other variations.

Classical information theory23 has provided techniques for capturing and encoding statistical properties from data that affect almost all areas of science. For example, mutual information captures various averages based on associated distributions of statistical properties that are contained in one variable about another; that is, it captures how information can be combined and decomposed in purely statistical terms. Recent developments based on information decomposition have been introduced17,24 with the purpose of separating multivariate signals into their alleged generative sources. A proposal that has gained some traction is the so-called ‘partial information decomposition’17, which falls short24, among other reasons, because it can tell only what a variable can statistically reveal about some other variable.

Our approach can be viewed as replacing the methods used in computational mechanics25,26 (traditionally based on, for example, Markov processes and Bayesian inference) with a measure based on algorithmic probability and an empirical estimation of the so-called universal distribution27,28 (the distribution associated with algorithmic probability), while preserving the spirit of computational mechanics. In another related line of investigation into inductive inference, albeit mostly of a theoretical nature, there are approaches (such as AIXI) that combine algorithmic probability with decision theory, replacing the prior with the universal distribution29, something that we have also proposed in the context of algorithmic cognition research30.

In actual deployments, however, many approaches circumvent uncomputability and intractability by relying heavily on popular lossless compression algorithms such as Lempel–Ziv–Welch (LZW), minimum description length31, Monte Carlo search and Markov processes, thereby effectively adopting weaker models of computation. Another related theoretical approach is Levin’s search32 and variations33, based on a dovetailing algorithm interleaving computer programs one step at a time, from shortest to longest, with each program assigned a fraction of time proportional to its probability during each iteration. Resource-bounded algorithmic (Kolmogorov–Chaitin) complexity34 is also related to our approach. It imposes an upper bound on program length, which effectively adopts a linear finite automaton computational model35, which we circumvent by allowing improvements while increasing the running time (though in practice each calculation is restricted to a resource-bounded calculation). In another category are methods of inductive inference such as computational measures of information gain and reinforcement36,37.

One strength in our approach that Solomonoff himself defended38 is that if machines are going to have a problem-solving capability similar to those of humans, machines should not start from scratch every time for each new problem. Therefore, we precompute and store estimations of the universal distribution from explorations of large sets of small computer programs that are able to explain small pieces of data. Each computer program represents a discrete generative model of its corresponding piece of data, which, when assembled in sequence, represents a full model. The precomputation allows practical applications to be implemented in linear time by querying a look-up table27,28 acting as a working (Turing machine) tape and exchanging memory for computational time. This technique, combined with classical information theory in the way that the sequence model is assembled, provides key hints on the algorithmic content of a piece of data39. The estimations of our approach, denoted ‘BDM’, can always be improved on by running the original uncomputable process, with the advantage that partial results exponentially decrease the multiplicative constant involved in approaches such as Levin’s search. Our BDM approach (see Supplementary Information) has been shown to go beyond statistical capabilities39, capturing features that cannot be grasped by methods such as Shannon entropy, traditional pattern recognition, or lossless compression.

Our approach to deconvolution, based on algorithmic information dynamics1, is a bottom-up approach deeply rooted in Solomonoff’s inductive inference15,40,41; it is motivated by previous approaches, but conceived and designed for scalability without compromising the power of the computational model too early on. Behind the number or sequence of numbers matching observations or data to a complexity value using our methods, we also offer access to the generating rules explaining the data as candidate models that can be used for validation against present and future data, thus enabling predictions.

At the core of algorithmic information dynamics1 is the process of finding algorithmic candidate models in the form of computer programs able to explain pieces of observed data and group them by same/similar source or generating mechanism, using as a guide the length of the set of computer programs that produce each piece of data when a larger piece of data is decomposed in all possible ways by perturbation analysis. The main point is that if a computer program generates the data, different regions of the same data can be explained by the same algorithms.

Deconvolution algorithms

The algorithm in the context of networks is as follows. Let G be a graph and let E = E(G) denote the set of edges. Let G\e denote the graph obtained after deleting an edge e from G. Let C(G) be the estimation of the algorithmic complexity of G (see Supplementary Information). The ‘information contribution’ of e to G is given by I(G, e) := C(G) − C(G\e). A positive information contribution corresponds to information loss and a negative contribution to information gain. Here we wish to find the subset FE such that the removal of the edges in F disconnects G into N components and minimizes the loss of information among all subsets of edges, that is the subset such that I(G, F) ≤ I(G, S) for all SE. Algorithm (1) in Supplementary Information allows us to obtain the subgraph (V, E\F) subject to the above conditions. The desired subset of edges is then given by F = E(G)\E(DECONVOLVE(G, N)).

The only parameter that the algorithm (see algorithm (1) in Supplementary Information) requires is N, the maximum number of components into which an object will be decomposed. However, there is a natural way to find the optimal terminating step, and therefore the number of maximum possible components that minimize the sum of the lengths of the candidate generating mechanisms, making the algorithm truly parameter-free, as it is not required to have a preset number of desired components. We provide the pseudo-code (in algorithm (2) of the Supplementary Information) that determines the optimal number of components; it requires no N parameter.

Clearly, if components s1 and s2 have the same algorithmic complexity, this does not immediately imply that s1 and s2 are generated by exactly the same generating mechanism. However, because of the exponential decay of the algorithmic probability of an increasingly random object, we know that the less random an object is, the exponentially more likely it is that the underlying mechanism will be the same. This is because there are exponentially fewer short (far from randomness) programs than long ones (see Supplementary Fig. 9c). For example, in the extreme case of fully connected graphs, we see that the complete graph denoted by Kn, with n the node count, has the smallest possible algorithmic complexity growth as a function of n, that is, ~c + log[n], where c is the length of the shortest computer program that implements the program that produces the matrix of size n × n with all 1 entries representing the complete graph. If |C(s1) − C(s2)| ≈ log[n], and s1 and s2 are connected graphs, then s1 and s2 are with high probability produced by the same algorithm. Conversely, if |C(s1) − C(s2)| departs from log(n), then the likelihood of being generated by the same algorithm exponentially vanishes. So the information regarding both the algorithmic complexity of the components and their relative size sheds light on the candidate generating mechanisms.

Algorithm unsupervised termination criterion

The algorithm suggests a natural terminating criterion. Let S be the object that has been produced by N mostly independent generative mechanisms. We decompose S into n parts s1, …, sn in such a way that each si, i {1 … n} has an underlying generating mechanism found by running the algorithm iteratively for increasing n. But after each iteration we calculate the minimum of the differences in algorithmic complexity among all components. The algorithm should then stop when the number of components is exactly N and the sum of the lengths—the estimated algorithmic complexity—of each of the programs diverges from the expected log[N]), because the length of the individual causal mechanisms producing each new component will be breaking down a component that could previously be explained by the causal mechanism at an earlier iteration of the algorithm. An implementation of this idea for a graph is shown in algorithm (2) in Supplementary Information).

As a trivial example, let us take the string 1n, where Sn means that the pattern S is repeated n times. After application of the algorithm, the terminating criterion will suggest that 1n cannot be broken down into smaller segments, each with a different causal generating mechanism, the sum of whose total lengths will be shorter than the length of the generating mechanism producing 1n itself. This is because the sum of the length of the shortest programs \({\sum }_{i}\left|{p}_{i}\right|\) running on a universal Turing machine generating segments of 1n of length mi < n each, such that the concatenation i=1 pi = 1n will be strictly greater than C(1n), given that each pi halting criterion will require i log mi bits more than C(1n).

Application to intertwined computer programs

A cellular automaton is a computer program that applies in parallel a global rule composed of local rules on a tape of cells with symbols (for example, binary). Thoroughly studied in ref. 1, elementary cellular automata (or ECA) are one-dimensional cellular automata that take into consideration in their local rules the cell next to the centre and the centre cell. For technical details see ‘Cellular automata’ section in the Supplementary Information.

Cellular automata offer an optimal testbed for our purposes because they are discrete dynamical systems able to illustrate their operation in a visual fashion. A cellular automaton can be interpreted as a one-dimensional object that produces a highly integrated two-dimensional image whose rows are causally connected and are thus ideal testing cases. Clearly, the deconvolution algorithms can be adapted to any object. In this case, instead of edge removal we apply row removal to the evolution of interacting cellular automata (see ‘Cellular automata’ section in the Supplementary Information for technical details). In what follows we perform experiments using interacting programs such as cellular automata as examples to illustrate the deconvolution algorithms. Interacting programs need to define how the interaction happens (see section ‘Interacting CA’ in Supplementary Information).

Behind our deconvolution methods is the idea that we can find a set of small computer rules or programs able to reconstruct a piece of data. Figure 1c,d illustrates this. In the deconvolution of a string generated by two different mechanisms (Fig. 1a,b) and thus in two different regimes (random versus non-random), computer programs such as those in Fig. 1c,d help to deconvolute the string. We then do the same in all other cases, but extending the number of degrees of freedom of a Turing machine tape. We note that our methods, as illustrated in Fig. 1a,b, are invariant to production or representation direction, given that the algorithmic probability and its reversal (and the entire set of computable transformations) are the same up to a small constant (the length of the computable transformation)42.

Decomposition of sequences and space–time diagrams

We used different programs to produce different parts of a string, that is, a program p to generate segment s1 and a program p′ to generate segment s2 put next to each other. Clearly, the string has been generated by two generating mechanisms (p and p′). Next we used the algorithm to deconvolve the string and find the number of generating mechanisms and most likely model mechanisms (the programs themselves), inducing a form of ‘algorithmic partition’ based on the likelihood of each segment being produced by different generating mechanisms.

Figure 1a–e illustrates how strings that have short generating mechanisms are significantly and consistently more sensitive to perturbations. The resulting string is 0101010101010101010101010101010101010101010101010101110100101010100000001001100111100110000011100110, with the bolding corresponding to the parts suggested by the different regimes, according to their algorithmic contribution and the segment’s resilience in the face of perturbations (by deletion and replacement) of the original string. Behind every real number providing an estimation of the algorithmic complexity of a string or object, our methods also provide a set of generating computer programs able to produce such object when using our algorithmic-probability-based measure BDM.

Not only did we find the correct number of mechanisms (Fig. 1a,b), but also we found the candidate programs (which for this trivial example are exactly the original) that generate each segment (Fig. 1c–e) by seeking the shortest computer programs in a bottom-up approach27,28. Finding the shortest programs is, however, secondary, because we only care about the explanatory powers that different programs have to explain the data in full or in part, pinpointing the different causal nature of the segments and helping in the deconvolution of the original observation.

Figure 1c–e depicts the computer program (a non-terminating Turing machine) that is found when calculating the BDM of the 01n string. The BDM approximation to the algorithmic complexity of any 01n string is thus the number of small computer programs that are found capable of generating the same string or, conversely (via the algorithmic coding theorem, see refs. 39,43), the length of the shortest program producing the string. For example, the string 01n was trivially found to be generated by a large number of small computer programs (Fig. 1c,d depicts a non-terminating Turing machine with E its output) using our algorithmic methods (as opposed to, for example, using lossless compression, which would only obfuscate the possible generating model) with only two rules out of 2 × 2 rules for the size of Turing machine with only two states and two symbols and no more, and thus of very low algorithmic complexity compared to, for example, generating a random-looking string that would require a more complex (longer) computer program. The computer program of a truly random string will grow in proportion to the length of the random string, but for a low-complexity string such as 01n, repeated any number of times n, the length of the computer program is of (almost) fixed size, growing only by log(n) if the computer program is required to stop after n iterations. In this case, 01n is a trivial example with a strong statistical regularity whose low complexity could be captured by applying Shannon entropy alone on blocks of size 2.

Cellular automaton two-dimensional deconvolution

Figure 1f–g illustrates how the algorithm can separate regions produced by generating mechanisms of different algorithmic information content by observing their space–time dynamics, thereby contributing to the deconvolution of regions that are produced by different generating mechanisms. In this example, both programs are sufficiently robust to not break down (see section ‘Robustness and limitations’ in Supplementary Information) when they interact with each other, with rule 110 prevailing over 255. Yet, in the general case, it is not always easy to tell these mechanisms apart. In more sophisticated examples, such as in Fig. 2d,e, we see how the algorithm can break down contiguous regions separating an object into two major components, corresponding to the different generating computer programs that are intertwined and actively interacting with each other. The experiment was repeated 20 times using programs with differing qualitative (for example, Wolfram class) behaviour.

Figure 2f demonstrates that perturbations of regions in red have a more random effect after their application and are thus by themselves less algorithmically random. When regions are of the same algorithmic complexity they are likely to be generated by similar algorithms, that is, from algorithms that are of similar minimal length. The removal of pixels in the blue regions moves the interacting system away from randomness and the pixels are themselves more algorithmically random. Blue structures on the left-hand side correspond to large triangles occurring in ECA rule 110 that are usually used to compute and transfer information in the form of particles.

However, triangular patterns transfer information in a limited way because their light cone of influence reduces at the greatest possible speed of the automaton, and they are assigned an absolute neutral information value. Absolute neutral values are those closest to 0. Once separated, the two regions have clearly different algorithmic characteristics given by their causal perturbation sensitivity, with the right-hand side being more sensitive to both random and non-random perturbations. Moreover, Fig. 2f shows results compatible with the theoretical expectation and findings in ref. 1, where a measure of reprogrammability associated with the number and magnitude of elements that can move a dynamical system towards or away from randomness was introduced and shown to be related to fundamental properties of the attractor space of the system.

Figure 2c–f shows that by iterating the deconvolution algorithm not only do the two main components of the image correspond to the two generating ECA rules, but a second application of the algorithm would produce a third or more components corresponding to further resilient features generated by the rules, which can be considered rules themselves within a smaller rule (state/symbol) space. However, in the deconvolved observations, the interacting rule determining how two or more rules may interact effectively constitutes a third global rule to which the algorithm has no direct access, or an apparent region in the observed window.

In the case of Fig. 2, the terminating criterion retrieves N = 3 components from the two interacting ECA (rule 60 and 110). This does not contradict the fact that we started from two generating mechanisms, because there are three clear regimes that are in fact likely to be reproducible by three different generating mechanisms, as suggested by the deconvolution algorithm itself, and as found in ref. 44, where it has been shown that rule 110 can be emulated by the composition of two simpler ECA rules (rules 51 and 118). As seen in Fig. 2, among the possible causal partitions, N = 2 successfully deconvolves ECA rule 60 from rule 110 on the first run, with a greater difference than the difference found between N = 3 components when breaking down rule 110 into its two different regimes.

Unsupervised network deconvolution

Clustering can usually be viewed as solving a problem which has an underlying tree structure according to some measure of interest. One way to think of optimal classification is to discover a tree structure at some level of depth, with tree leaves closer to each other when such objects have a common or similar causal mechanism and for which no feature of interest has been selected. Figure 3 illustrates how the algorithm may partition data, in this case starting from a trivial example that breaks down complete K-ary trees. Traditionally, both partitioning and clustering are induced by an arbitrary distance measure of interest that determines the connections in a tree, with elements closer to a cluster centre connected by edges. The algorithm breaks down the trees (see Fig. 3) into as many components as desired by iterating over remaining elements if required until the number of desired components is obtained or the terminating criterion is applied (see subsection ‘Algorithm unsupervised termination criterion’). Figure 3a,b provides examples illustrating how to maximize topological symmetry. The algorithm can be applied, without loss of generalization, to any non-trivial graph, as in Fig. 3c,d, or to any dataset for that matter.

Figure 4 illustrates the algorithm and terminating criterion starting from an artificial graph composed of several graphs (two simple and one that is Erdős-Rényi random: a small Erdős-Rényi-random graph connected to a star graph and to a complete graph). The graph can be successfully decomposed by algorithmic probability (see Fig. 4d) by identifying the likelihood of an edge being produced by the same mechanism by virtue of being close to each other in the information contribution (which theoretically should be removed by only log(2) if it follows the normal evolution of the same process), hence what we call causal separation/partition and clustering. Figure 4d, with the broken components that were found above log(2) + ε, also shows the distribution of edges coloured by graph membership, perfectly corresponding to the different subgraphs that were used to compose the original graph in Fig. 4a.

The same task using classical information theory (Shannon entropy) is shown not to be sensitive enough, and a popular lossless compression algorithm (compression based on LZW) provided a noisy approximation of the results obtained by using the block decomposition method, as defined in ref. 39 (see section ‘Other methods and measures’ in Supplementary Information for details).

Figure 3c–e illustrates how randomly connected graphs with different topologies can be broken down into their respective generative mechanisms. Figure 3c is a complete graph of size 20 randomly connected by three edges to a scale-free graph of size 100. The graphs are generated by different mechanisms. One is a small program that, given a number N of nodes, produces a graph with all nodes connected to all other N − 1 nodes and has a program of short length that grows only by log N11. The scale-free network is generated by the canonical preferential attachment algorithm with two edges per node and requires a slightly longer algorithm that grows by log N + c (ref. 11), where c is a small constant accounting for the pseudo-random choice of attachment nodes. The algorithm breaks the graphs into two components, each of which corresponds to the graphs with different degree distributions (depicted below each case) associated with its generating mechanism. This is because \(\left|P\left({G}_{1}\right)\right|\) + \(\left|P\left({G}_{2}\right)\right|\) + … + \(\left|P\left({G}_{n}\right)\right|\) + \(\left|P\left({e}_{{G}_{i}}\right)\right|\) > \(\left|P\left({G}_{1}{G}_{2}\ldots {G}_{n}\right)\right|\) for any Gi, where \({e}_{{G}_{i}}\) is the set of edges randomly connecting Gi to Gj for any i and j for all G of low algorithmic complexity.

Figure 3d illustrates a case similar to that in Fig. 3c, but instead of a complete graph, an Erdős-Rényi graph with edge density 0.5 is produced and connected by three random edges to a scale-free network produced in the same fashion as in Fig. 3c. Again, the algorithm was able to break it down into the two corresponding subgraphs.

Current approaches to machine and deep learning are45 ill-equipped to deal with inductive inference, explanation and causation. Our methods are different from those used in other approaches (even those based on lossless compression algorithms) to estimate algorithmic complexity, and in particular, those from classical information theory and other statistical traditions. The methods introduced here promote the use of techniques from causal and perturbation analysis14 complemented by universal principles drawn from the theory of computability and algorithmic complexity.

Comparisons to other methods indicate that our approach is accurate and sensitive even in simplified form based on single-pixel perturbation (as opposed to, for example, full subset perturbations). This means there is also a lot of room for improvement and further exploration of other applications and other areas based on the same principles.

Our approach contributes to the discussion on ways to teach machine learning cause and effect, as recently called for by Pearl45 so as to depart from traditional statistical approaches in machine learning. We strongly believe (just as did Minsky46) that moving in the algorithmic direction and introducing symbolic computation complemented by previous achievements in the area of causation such as counterfactuals and the perturbation analysis introduced by Pearl et al.14, combined with the combinatorial power of current approaches to deep learning, is the way forward in addressing the challenge of causation in machine learning. Approaches like ours open up the possibility of designing more elaborate machine learning techniques with complementary and better equipped abilities grounded on completely different first principles.

Code availability

A basic online implementation is available at Implementations in R and the Wolfram Language are available at and Some functions are also included in the Supplementary Information (Supplementary Methods).

The data that support the plots within this paper are available from the corresponding author upon request.

H.Z. was supported by Swedish Research Council (Vetenskapsrådet) grant number 2015-05299. J.T. was supported by the King Abdullah University of Science and Technology.