Machine learning is probably the most important fundamental trend in technology today. Since the foundation of machine learning is data - lots and lots of data - it’s quite common to hear that the concern that companies that already have lots of data will get even stronger. There is some truth to this, but in fairly narrow ways, and meanwhile ML is also seeing much diffusion of capability - there may be as much decentralization as centralization.
First, what does it mean to say that machine learning is about data? Due to the academic culture that ML comes from, pretty much all of the primary science is published as soon as it’s created - almost everything new is a paper that you can read and build with. But what do you build? Well, in the past, if a software engineer wanted to create a system to recognise something, they would write logical steps (‘rules’). To recognise a cat in a picture, you would write rules to find edges, fur, legs, eyes, pointed ears and so on, and bolt them all together and hope it worked. The trouble was that though this works in theory, in practice it’s rather like trying to make a mechanical horse - it’s theoretically possible, but the decree of complexity required is impractical. We can’t actually describe all of the logical steps we use to walk, or to recognise a cat. With machine learning, instead of writing rules, you give examples (lots of examples) to a statistical engine, and that engine generates a model that can tell the difference. You give it 100,000 pictures labelled ‘cat’ and 100,000 labelled ‘no cat’ and the machine works out the difference. ML replaces hand-written logical steps with automatically determined patterns in data, and works much better for a very broad class of question - the easy demos are in computer vision, language and speech, but the use cases are much broader. Quite how much data you need is a moving target: there are research paths to allow ML to work with much smaller data sets, but for now, (much) more data is almost always better.
Hence the question: if ML lets you do new and important things and ML is better the more data you have, then how far does that mean that companies that are already big and have lots of data get stronger? How far are there winner-takes-all effects? It is easy to imagine virtuous circles strengthening a winner: ‘more data = more accurate model = better product = more users = more data’. From here it’s an easy step to statements like ‘Google / Facebook / Amazon have all the data‘ or indeed ‘China has all the data’ - the fear that the strongest tech companies will get stronger, as will countries with large populations and ‘permissive’ attitudes to centralised use of data.
Well, sort of.
First, though you need a lot of data for machine learning, the data you use is very specific to the problem that you’re trying to solve. GE has lots of telemetry data from gas turbines, Google has lots of search data, and Amex has lots of credit card fraud data. You can’t use the turbine data as examples to spot fraudulent transactions, and you can’t use web searches to spot gas turbines that are about to fail. That is, ML is a generalizable technology - you can use it for fraud detection or face recognition - but applications that you build with it are not generalized. Each thing you build can only do one thing. This is much the same as all previous waves of automation: just as a washing machine can only wash clothes and not wash dishes or cook a meal, and a chess program cannot do your taxes, a machine learning translation system cannot recognise cats. Both the applications you build and the data sets you need are very specific to the task that you’re trying to solve (though again, this is a moving target and there is research to try to make learning more transferable across different data sets).
This means that the implementation of machine learning will be very widely distributed. Google will not ‘have all of the data’ - Google will have all of the Google data. Google will have more relevant search results, GE will have better engine telemetry and Vodafone will have better analysis of call patterns and network planning, and those are all different things built by different companies. Google gets better at being Google, but this does not mean it somehow gets good at anything else.
Next, one could argue that this just means the larger companies in each industry get stronger - Vodafone, GE and Amex each have ‘all the data’ for whatever it is that they do and so that forms a moat against their competition. But here again, it’s more complex: there are all sorts of interesting questions about who exactly owns the data, how unique it is and at what levels it’s unique, and where the right point of aggregation and analysis might be.
So: as an industrial company, do you keep your own data and build the ML systems to analyse it (or pay a contractor do do this for you)? Do you buy a finished product from a vendor that’s already trained on other people’s data? Do you co-mingle your data into that, or into the training derived from it? Does the vendor even need your data or do they already have enough? The answer will be different in different parts of your business, in different industries and for different use cases.
To come at this from the other end, if you’re creating a company to deploy ML to solve a real-world problem, there are two basic data questions: how do you get your first data to train your models to get your first customer, and how much data do you actually need? Of course, the second question breaks down into lots of questions: is the problem solved with a relatively small amount of data that you can get fairly easily (but many competitors can get), or do you need far more, hard-to-get data, and if so is there a network effect to benefit from, and so a winner takes all dynamic? Does the product get better with more data indefinitely, or is there an S curve?
Some data is unique to the business or product or gives a strong proprietary advantage. GE engine telemetry might not be much use for analyzing Rolls Royce engines, but if it is they won’t share it. This might be an opportunity for company creation, but is also a place where lots of internal big company IT and contractor projects happen
Some data will apply to a use case that is found in many companies or even many industries. ‘There is something odd about this call’ might be a common analysis across all credit card companies - ‘the customer sounds angry’ might apply to anyone with a call centre. This is the ‘co-mingling’ question. Lots of companies are being created here to solve problems across many companies or indeed across different industries, and there are network effects in data here.
But there will also be cases at which after a certain point the vendor doesn’t really even need each incremental customer’s data - the product is already working.
In practice, as machine learning diffuses into almost everything, one startup might see several of these. Our portfolio company Everlaw produces legal discovery software: if you sue someone and they send you a truck full of paper, this helps. Machine learning means they will be able to do sentiment analysis on a million emails (‘show me anxious emails’), without needing to train that model on the data from your case, because the examples of sentiment to train that model don’t need to come from this particular lawsuit (or any lawsuit). Conversely, they can also do cluster analysis (‘show me emails that are about the same thing as this’) on your data without that going anywhere else. Drishti, another portfolio company, uses computer vision to instrument and analyse production lines - some of those capabilities are trained on your data and some are not specific to your business at all and work across industries.
At the extreme, I recently spoke to a manufacturer of very large vehicles that’s using machine learning to get a more accurate flat tyre detector. This is trained on data (lots and lots of examples of signal from flat and not-flat tires), obviously, but it’s not hard to get that data. This is a feature, not a moat.
Hence, I said earlier that there are two questions for an ML startup: how do you get the data and how much do you need? But those are just the technical questions: you also ask how you go to market, what your addressable market is, how valuable the problem you’re solving is to your customers, and so on and so on. That is, pretty soon there won’t be any ‘AI’ startups - they will be industrial process analysis companies, or legal platform companies, or sales optimization companies. Indeed, the diffusion of machine learning means not so much that Google gets stronger, but that all sorts of startups can build things with this cutting edge science much quicker than before.
This takes me to a metaphor I’ve used elsewhere - we should compare machine learning to SQL. It’s an important building block that allowed new and important things, and will be part of everything. If you don’t use it and your competitors do, you will fall behind. Some people will create entirely new companies with this - part of Wal-Mart’s success came from using databases to manage inventory and logistics more efficiently. But today, if you started a retailer and said “…and we’re going to use databases”, that would not make you different or interesting - SQL became part of everything and then disappeared. The same will happen with machine learning.