One of the great ironies of the “big data” revolution is the way in which so much of the insight we draw from these massive datasets actually comes from small samples not much larger than the datasets we have always used. A social media analysis might begin with a trillion tweets, use a keyword search to reduce that number to a hundred million tweets and then use a random sample of just 1,000 tweets to generate the final result presented to the user. As our datasets get ever larger, the algorithms and computing environments we use to analyze them have not grown accordingly, leaving our results to be less and less representative even as we have more and more data at our fingertips. What does this mean for the future of “big data?”
Stepping back from all of the hype and hyperbole, there is considerable truth to the statement that we live in an era in which data is valued sufficiently that we believe it worth the time and expense to collect, store and analyze it at scales that significantly exceed those of the past.
It is just as true that many of our beliefs regarding the size of the datasets we use are are absolutely wrong. In particular, many of the vanguards of the big data era that we hold up as benchmarks of just what it means to work with “big data” like Facebook and Twitter, are actually vastly smaller than we have been led to believe.
In many ways, much of the size of the “big data” revolution exists only in our imaginations, aided by the reality distortion field of the big web companies that tout their enormous scale without actually releasing the hard numbers that might lead us to call those claims into question.
Most troublesome of all, however, is the way in which we analyze the data we have.
Computing power continues to increase and in today’s world of GPUs, TPUs, FPGA’s and all other manner of accelerators, we have no shortage of hardware on which to run our analyses. The problem is that despite nearly unfathomable amounts of hardware humming away in the data centers of the big cloud companies, we are still just as hardware constrained as we have always been.
We may have vast amounts of hardware, but the datasets we wish to analyze are even larger.
It is a truth of the modern computing era that it is far cheaper to collect and store data than it is to analyze it. In many ways this is an ironic reversal from even just a decade ago in which many large scientific computing codes would recompute intermediate results from scratch rather than store them to disk due to it being faster to compute the result again than load it from a large direct attached HPC storage fabric.
Where CPU once outpaced disk, today it is the opposite. Storage is now so cheap that keeping multiple copies of a petabyte across multiple geographically distributed datacenters for maximum redundancy costs less than $10,000 a month in the cloud. Storing a single copy in an on-premises JBOD costs less than $25,000 worth of NAS-grade drives and prices continue to fall.
The ratio between the size of our data and the computing power we have to process that data isn’t remarkably changed.
My IBM PS/2 Model 55SX desktop in 1990 had a 16Mhz CPU, 2MB of RAM and a 30MB hard drive. That’s roughly one Hz of processor power for every 1.87 bytes of hard drive space.
Fast forward almost 30 years to 2019. A typical commercial cloud VM “core” is a hyperthread of a 2.5Ghz to 3.5Ghz (turbo) physical processor core with 2GB-14GB RAM per core and effectively infinite storage via cloud storage, though local disk might be limited to 64TB or less. Using the same CPU-to-disk ratio as 1990, that 3.5Ghz turbo hyperthread would be paired to 6.6GB of disk (actually, 3.3GB of disk, taking into account the fact that a hyperthread is really only half of a physical core).
Such a comparison isn’t quite fair given the capabilities of newer CPUs and the vastly improved speeds of modern disk systems, but it is worth pointing out just how far the ratio between CPU power and disk storage has slipped over the past three decades.
Even when it comes to data transfer rates, little has changed in those 30 years. The disk transfer rate on the 1989 IBM PS/2 Model 55SX was rated at 7.5MBPS (around 0.94MB/s). Dividing the CPU’s 16Mhz clock speed by 0.94MB/s works out to around 17 bytes/s per Hz. Fast forward today and a typical cloud VM might max out at around 180MB/s maximum sustained read from standard persistent disk. With a turbo clock speed of 3.5Ghz hyperthread / 180MB/s that works out to around 19 bytes/s per Hz, roughly the same disk transfer rate per Hz almost 30 years later.
Of course, there are many other factors that influence the speed with which data from disk can actually be utilized by a CPU in practice, but no matter how you slice it, in terms of the ratio between the performance of our storage systems and the performance of our CPU systems, we haven’t really undergone a transformative shift in the decades since I sat at my first desktop. Moreover, the virtualization of the modern cloud enacts a steep cost that slows it even further.
Modern clouds address this differential through data parallelism, sharding data across multiple virtual cores. Yet, to maintain the same ratio of CPU to disk to process a petabyte with 3.5Ghz virtual cores in the modern cloud would require a cluster of 152,381 VM cores, assuming perfect linear scalability. A 10PB dataset would require 1.5 million VM cores. In reality, communications overhead and the limits of hardware scaling mean the number of cores required to achieve theoretical scaling results would be considerably higher.
Amazingly, the CPU cost for all those cores would only be a few hundred dollars to a few thousand dollars per hour of analysis and a few thousand dollars to store the petabyte for a week (not counting the time it takes to upload or ship).
Moreover, platforms like Google’s cloud make it trivial to scale across large datasets, with BigQuery able to table scan a full petabyte in under 3.7 minutes with one line of SQL, no programming or manual data sharding required.
We obviously aren’t lacking for the hardware to process petabytes or even tens or hundreds of petabytes.
In fact, BigQuery even offers fixed rate pricing starting at $40,000 per 2,000 slots that allows companies to perform unlimited queries over unlimited volumes of data for the same fixed monthly cost. A company’s data science division could query multi-petabyte datasets non-stop 24 hours a day for the same fixed cost, removing cost as a limiting factor in performing absolute population-scale analyses.
The problem is that when it comes to big data analyses there seems to be a tremendous gulf between the companies performing population-scale analyses using tools like BigQuery that analyze the totality of their datasets and return absolute results and the rest of the “big data” world in which estimations and random samples seem to dominate.
Estimations are particularly prevalent in spaces like social media and behavioral analysis.
In a world in which all it takes is a single line of SQL to analyze tens of petabytes with absolute accuracy, why is estimation so popular?
Partially the answer is our preoccupation with speed over accuracy. Why wait minutes for petascale analyses when you can wait seconds for a random sample that may or may not bear even the slightest resemblance to reality?
In turn, our ability to tolerate immense error in our “big data” results often comes from the fact that the consequences for bad results are minimal in many “big data” domains. An analysis of the most influential Twitter users by month for a given topic over the last few years doesn’t really have a “right” answer against which an estimation can be compared. Moreover, even if the results are entirely wrong, the consequences are minimal.
Partially the answer is that our more complex algorithms have not kept pace with the size of the datasets we work with today. Many of the analytic algorithms of greatest interest to data scientists were born in the era of small data and have yet to be modernized for the size of data we wish to apply them to today. Few mapping platforms can perform spatial clustering on billions of points, while graph layout algorithms struggle to scale beyond millions of edges.
In short, the volume of data available to us has outpaced the ability of our algorithms to make sense of it. In the past we had low volumes of very rich data. Today we have high volumes of very poor data, meaning we have to process much more data to achieve the same results.
When our algorithms aren’t scalable, we are left processing the same amount of data as before, but that sample is less and less representative of the whole.
A classical SNA graph analysis of the past might involve a network with a few thousand or tens of thousands of edges, which could be visualized in its entirety even by the software of the day. Today we can easily obtain graphs in the billions of edges, but our most common graph visualization tools struggle to produce useable results in reasonable time with anything more than a few hundred thousand edges.
That means that while a visualization of the past might have reflected the entire graph, today that graph is more likely to represent just a miniscule fraction of the full dataset. A 10,000-edge graph visualized in Gephi shows the totality of its structure. A 10 billion-edge graph sampled down to 10,000 edges represents just 0.0001% of the total graph.
It isn’t just a question of speed. Many of our most heavily used algorithms were never designed to work with large datasets even when speed is not of concern. Few graph layout algorithms can do much to extract the structure of a dense trillion-edge graph even if given limitless computing power and as much time as they need to complete.
In short, we are confronted with the paradox that the more data we have, the less representative our findings are due to the need for sampling.
Platforms like BigQuery are slowly changing the algorithmic side of that equation. As they move beyond their reporting roots towards providing high level analyses like geospatial analytics, they are beginning to externalize the kind of massive algorithmic scalability that Google uses internally to bring more and more algorithms into the scalable cloud era. As these trends progress, algorithmic scalability will steadily become less of a limiting factor for common use cases.
That leaves the mindset factor.
We need to move beyond the idea that it is acceptable to perform the “sampling in secret” used by some social media analytics platforms in which certain displays quietly rely on sampling even while they tell their users they perform absolute counts.
Sampling can be useful in certain circumstances. However, we need to understand precisely when and where it is used, the size of the sample and how that sample was selected. Understanding that a visualization was based on a random sample of just 1,000 out of 100 million matching tweets might give pause to how representative its results truly are. Moreover, if that “random sample” is actually based on randomly selecting 100 dates and taking the first 10 tweets from each date in chronological order, that may result in very different findings than sampling at the tweet level. Having visibility into these methodological decisions is absolutely critical.
Perhaps the biggest issue is that we need to stop treating “big data” as a marketing gimmick.
As companies have begun to market themselves in terms of the size of the datasets they hold and analyze, we’ve let go of the idea of actually understanding anything about the data and algorithms we’re using.
Even the most rigorous data scientists freely accept the idea of reporting trends from Twitter without having any idea just what the full trillion-tweet archive from which they are working actually looks like. In fact, few data scientists that work with Twitter even know there have been just over one trillion tweets sent.
Understanding your denominator used to be something that was considered sacrosanct to data science. Somehow, we’ve reached a point where we just accept reporting findings from analyses where we no longer have a denominator.
Once we start treating “big data” analysis as a methodologically, algorithmically and statistically rigorous process based on well understood data, we recognize the need for population-scale analyses with guaranteed correctness. We turn to platforms like BigQuery and its ilk to run our analyses and focus on accuracy and completeness. We modernize the algorithms we need. We relentlessly test our results.
Most importantly, we restore the denominator to our workflows.
Putting this all together, it is ironic that in a world drowning in data, we have increasingly turned to sampling to render our “big data” small again. As our datasets have grown in size, we have sampled them ever more aggressively to keep the actual amount of data fed to our analyses and algorithms roughly the same. Whereas in years past our analyses might have been based on the totality of a dataset, today that same analysis might consider just one ten thousandth of one percent of that dataset. As we have ever more data, our results are becoming ever less representative.
In short, our vaunted “big data” revolution has actually resulted in less understanding of our world through less representative data than ever before.
In the end, we need to stop treating “big data” as a marketing slogan and bring ourselves back to the era where we actually cared about the quality of our data analyses before we undermine the public's trust in the power of data.