Bill Inmon, who’s widely known as the father of the data warehouse, shared a warning for data-loving organizations at the recent Data + AI Summit: If you don’t adopt the precepts of the data lake house, you run the risk of your data lake turning into a data swamp.
“If you take your data lake and turn it into a lakehouse, you can actually now start to get your money’s worth out of it,” Inmon said in a conversation with Ali Ghodsi, the CEO of Databricks, which sponsored the event. “If you don’t turn it into a lakehouse, it turns into a swamp.”
Very few data lakes are successful, the “Building the Data Warehouse” author said.
Most of them do [fail],” Inmon told Ghodsi. “Every now and then, you see one that doesn’t, but most of them do.”
The problem with data lakes are architectural, Inmon said. “From a technical standpoint, I think the data lake…is fine. There’s nothing wrong with it,” he said. “But architecturally, there are many things that are missing from the data lake. And because they were missing, it made the data lake not useless, but it made it very difficult to get information out of.”
Inmon is currently writing a book about data lakehouses. It should come as no surprise that the folks at Databricks, who originally coined the term, are helping Inmon with the book, which will be his 61st.
A lakehouse, as Databricks describes it, is a blend of a data lake and a data warehouse. On the one hand, it provides the flexibility to handle less structured data types, such as text and image files, that are commonly used in data science and machine learning projects. But it also borrows from the data warehouse discipline, particularly in terms of ensuring the quality of the data and making sure that its lineage is tracked and governed.
Perhaps it comes as no surprise that Inmon isn’t a big fan of the ELT (extract, load, and transform] data integration method that is gaining traction among data lake practitioners. Instead of first transforming the data before loading it into the data warehouse, which is the standard ETL method, ELT backers first load the data into the data lake, with the expectation that they will transform it (i.e. clean it up and prepare it for analytics or machine learning) later on.
“I’ve always been a fan of ETL because of the fact that ETL forces you to transform data before you put it into a form where you can work with it,” Inmon said. “But some organizations want to simply take the data, put it into a database, then do the transformation…I’ve seen too many cases where the organization says, oh we’ll just put the data in and transform it later. And guess what? Six months later, that data is never been touched.”
Some types of data, in particular textual data, is nearly impossible to load into a data warehouse (or a data lake or a data lakehouse, for that matter) using anything but ETL, Inmon said.
“Text is a different beast altogether,” he said. “I’m not a believer that you can do ELT with text. I tell you what: If you can do it, I don’t know how.”
The more structure there is to the data, the better odds you’ll have success with ELT, because you can bring SQL to bear on it, Ghodsi said.
“With SQL, you can do a lot of the transformations actually,” Ghodsi said during his Data + AI Summit chat with Inmon. “But as you pointed out, for all these complex data types, the text and the audio and video and all these other data science workloads–it’s just very hard to express them with SQL.”
Inmon ended his converstation with a warning.
“It’s not so much a problem of, are they going to build a data [lake] house. It’s going to be what happens if they don’t build a lakehouse,” he said. “Because if they don’t build the lakehouse, they’re going have this mountain of data that sits there and nobody is going to be able to do anything with it…I believe that the lakehouse is going to unlock the data that is there and is going to present opportunities like we’ve never seen before. And that’s going be the effect of creating the lakehouse.”