What are some use cases for which it would be beneficial to use Haskell, rather than R or Python, in data science? originally appeared on Quora: the place to gain and share knowledge, empowering people to learn from others and better understand the world.
Haskell is so much better as a language—more expressive, faster, safer—that we should be asking ourselves why we wouldn’t want to use Haskell.
The answer is simple: libraries.
Haskell the language is great; Haskell the ecosystem is lacking. Haskell has not traditionally been used for data science so the library selection is limited. Integration with traditional “big data” tools is limited. While core pieces—like Spark bindings—exist, they are rarely as well-supported as in more popular data science languages. The same is true of statistical libraries and data visualization. You can use Haskell for data science, but you’ll need to either write bindings to libraries in other languages (tedious at best) or implement the libraries yourself.
It’s a real shame because, as a language, Haskell is remarkably suited to data science. I’ve found Haskell encodes mathematical concepts and ideas far more directly than other languages; I’ve written a lot of “mathy” Haskell code and it often ends up looking surprisingly close to what I would have written on a whiteboard. I’ve also taught Haskell to a few of my colleagues with math/OR backgrounds, and they took to it immediately because it naturally reflected how they thought outside of programming.
More generally, Haskell excels at abstraction, and data science benefits from coherent abstractions as much as math or software engineering. I find that Haskell’s type system is the perfect tool for building and organizing a model of some business domain, all the way from exploratory code through production systems. The Haskell approach forces you to state your assumptions up-front and doesn’t let you ignore details by accident; however, once you’ve set things up, there’s remarkably little overhead for using your domain-specific abstractions. In a world of complicated legacy business rules and messy data, it’s invaluable for reasserting order.
I know this from direct experience. I am—at least by job title—a data scientist, and I use Haskell. I work on a complicated system (Target’s supply chain) with its fair share of legacy warts and inconsistent data. I need all the help I can get, and Haskell is the language that gives me the best set of tools for managing this complexity. Haskell hasn’t been perfect (apart from libraries, we also ran into some hard-to-fix performance issues) but the overall experience has been positive. Even keeping these issues in mind, I believe Haskell has served our project better than any other language I know, regardless of libraries.
So would I recommend Haskell for data science?
Well, it depends. What problems are you solving now? What problems will you be solving a year from now? What are your bottlenecks going to be?
If you’re doing some fairly “standard” data science (is data science defined well enough to have “standard” work?), you’re best off with a “standard” data science language. The same goes if you’re primarily solving one-off problems and not interested in investing in long-term infrastructure like domain-specific libraries—another language might serve you better. Sometimes this is the right tradeoff! My personal model for using Haskell vs a more standard tool is that Haskell requires a larger up-front investment but ultimately leads to less work. Whether this investment makes sense is a core business question you have to answer.
On the other hand, I have no qualms recommending Haskell in two cases:
- Libraries you need are limited in any language.
- The bottleneck is modeling your domain and business logic rather than “pure” data science.
As it happens, the project I’m working on hits both cases. On the one hand, while existing libraries for supply chain optimization exist, they are pretty limited. A lot of practical problem solving still happens by manually specifying inputs to a mixed-integer linear programming (MILP) solver—and hooking Haskell up to a solver is not that much work.
Other problems are solved with custom optimization algorithms; while we started with a well-known approach, we had to make a lot of ad-hoc modifications to account for Target’s real-world processes. We wouldn’t have been able to use an off-the-shelf optimization algorithm (even if good implementations already existed) just because it would not have covered considerations important specifically to our supply chain.
On top of this, one of the biggest bottlenecks and one of the biggest value propositions for our team has been in simply simulating our supply chain operations. Haskell is great for encoding the complex and sometimes arbitrary business rules our operations follow. Haskell provides the powerful suite of tools we need to conquer this complexity between its expressiveness, its capability for abstraction and its type system. I’ve worked on complex projects in Python and Java and even Racket, and I have no doubts that the tangled mess of logic we have to model would be far less manageable in any other language. Haskell has an affinity for math but, in the end its type system and mathiness help more with domain-specific business code than anything else.
Finally, to knock off a couple of common questions people have:
- Haskell does not make recruitment harder, it makes it easier. Target’s data science team is surprisingly strong but, let’s be honest, Target does not have the sort of reputation that attracts strong candidates—we’re no Google or Facebook. Despite that, our team has had a far larger stream of eminently capable candidates than we could deal with. We’ve had a noticeably easier time hiring incredible people than other teams.
- Training people up on Haskell takes time—but it’s eminently manageable. I’m reminded of a great example I heard from an engineer at IMVU when they were starting to use Haskell side-by-side with PHP: it took just as long to train a new PHP developer on their in-house framework and codebase as it took to train that developer in Haskell. Is that a real cost? Yes. Is it significantly worse in Haskell than other languages? No.
All of this comes together for a language that can work well for a data science team—if the circumstances are right.
Of course, my entire answer operates on one core assumption: that it’s an either-or question. It isn’t, at least not with R! Haskell can interoperate directly with R using . You can write your abstractions in Haskell and do your stats and visualization in R, and, because of how Haskell and R are linked, Haskell can operate directly on values in the R heap. You can inline R code directly in your Haskell, using Haskell variables. This is much closer integration than a normal FFI.
And it all works with Jupyter too. Here’s an example of R code called from IHaskell:
(Screenshot taken from a blog post introducing HaskellR: .)
This question originally appeared on Quora - the place to gain and share knowledge, empowering people to learn from others and better understand the world. You can follow Quora on Twitter, Facebook, and Google+. More questions: