LinkedIn is a social network with petabytes of data.
In order to store that data, LinkedIn distributes and replicates that data across a large cluster of machines running the Hadoop Distributed File System. In order to run calculations across its large data set, LinkedIn needs to split the computation up using MapReduce-style jobs.
LinkedIn has been developing its data infrastructure since the early days of the Hadoop ecosystem. LinkedIn started using Hadoop in 2008, and in the last 11 years, the company has adopted streaming frameworks, distributed databases, and newer execution runtimes like Apache Spark.
With the popularization of machine learning, there are more applications for data engineering than ever before. But the tooling around data engineering means that it is still hard for developers to find data sets, clean their data, and build reliable models.
Carl Steinbach is an engineer at LinkedIn working on tools for data engineering. In today’s episode, Carl discusses the data platform inside LinkedIn, and the strategies that the company has developed around storing and computing large amounts of data.
Full disclosure: LinkedIn is a sponsor of Software Engineering Daily.
Check out our active projects:
- We are hiring a head of growth. If you like Software Engineering Daily and consider yourself competent in sales, marketing, and strategy, send me an email: email@example.com
- FindCollabs is a place to build open source software.
- The SEDaily app for iOS and Android includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. Subscribe for ad-free episodes.
Transcript provided by We Edit Podcasts. Software Engineering Daily listeners can go to weeditpodcasts.com/sed to get 20% off the first two months of audio editing and transcription services. Thanks to We Edit Podcasts for partnering with SE Daily. Please click here to view this show’s transcript.