2018’s Top 7 Libraries and Packages for Data Science and AI: Python & R

By Favio Vázquez

If you follow me, you know that this year I started a series called Weekly Digest for Data Science and AI: Python & R, where I highlighted the best libraries, repos, packages, and tools that help us be better data scientists for all kinds of tasks.

The great folks at Heartbeat sponsored a lot of these digests, and they asked me to create a list of the best of the best—those libraries that really changed or improved the way we worked this year (and beyond).

If you want to read the past digests, take a look here:

Disclaimer: This list is based on the libraries and packages I reviewed in my personal newsletter. All of them were trending in one way or another among programmers, data scientists, and AI enthusiasts. Some of them were created before 2018, but if they were trending, they could be considered.
https://github.com/tensorflow/adanet

AdaNet is a lightweight and scalable TensorFlow AutoML framework for training and deploying adaptive neural networks using the AdaNet algorithm [Cortes et al. ICML 2017]. AdaNet combines several learned subnetworks in order to mitigate the complexity inherent in designing effective neural networks.

This package will help you selecting optimal neural network architectures, implementing an adaptive algorithm for learning a neural architecture as an ensemble of subnetworks.

You will need to know TensorFlow to use the package because it implements a TensorFlow Estimator, but this will help you simplify your machine learning programming by encapsulating training and also evaluation, prediction and export for serving.

You can build an ensemble of neural networks, and the library will help you optimize an objective that balances the trade-offs between the ensemble’s performance on the training set and its ability to generalize to unseen data.

adanet depends on bug fixes and enhancements not present in TensorFlow releases prior to 1.7. You must install or upgrade your TensorFlow package to at least 1.7:

$ pip install "tensorflow>=1.7.0"

To install from source, you’ll first need to install bazel following their installation instructions.

Next clone adanet and cd into its root directory:

$ git clone https://github.com/tensorflow/adanet && cd adanet

From the adanet root directory run the tests:

$ cd adanet
$ bazel test -c opt //...

Once you have verified that everything works well, install adanet as a pip package .

You’re now ready to experiment with adanet.

import adanet

Here you can find two examples on the usage of the package:

You can read more about it in the original blog post:

https://github.com/EpistasisLab/tpot

Previously I talked about Auto-Keras, a great library for AutoML in the Pythonic world. Well, I have another very interesting tool for that.

The name is TPOT (Tree-based Pipeline Optimization Tool), and it’s an amazing library. It’s basically a Python automated machine learning tool that optimizes machine learning pipelines using genetic programming.

TPOT can automate a lot of stuff life feature selection, model selection, feature construction, and much more. Luckily, if you’re a Python machine learner, TPOT is built on top of Scikit-learn, so all of the code it generates should look familiar.

What it does is automate the most tedious parts of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data, and then it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

This is how it works:

For more details you can read theses great article by Matthew Mayo:

and Randy Olson:

You actually need to follow some instructions before installing TPOT. Here they are:

After that you can just run:

pip install tpot

First let’s start with the basic Iris dataset:

So here we built a very basic TPOT pipeline that will try to look for the best ML pipeline to predict the iris.target. And then we save that pipeline. After that, what we have to do is very simple — load the .py file you generated and you’ll see:

import numpy as np
from sklearn.kernel_approximation import RBFSampler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
train_test_split(features, tpot_data['class'], random_state=42)
exported_pipeline = make_pipeline(
RBFSampler(gamma=0.8500000000000001),
DecisionTreeClassifier(criterion="entropy", max_depth=3, min_samples_leaf=4, min_samples_split=9)
)
exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

And that’s it. You built a classifier for the Iris dataset in a simple but powerful way.

Let’s go the MNIST dataset now:

As you can see, we did the same! Let’s load the .py file you generated again and you’ll see:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE', delimiter='COLUMN_SEPARATOR', dtype=np.float64)
features = np.delete(tpot_data.view(np.float64).reshape(tpot_data.size, -1), tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes = \
train_test_split(features, tpot_data['class'], random_state=42)
exported_pipeline = KNeighborsClassifier(n_neighbors=4, p=2, weights="distance")
exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)

Super easy and fun. Check them out! Try it and please give them a star!

https://github.com/ironmussa/Optimus

Ok, so full disclosure, this library is like my baby. I’ve been working on it for a long time now, and I’m very happy to show you version 2.

Optimus V2 was created to make data cleaning a breeze. The API was designed to be super easy for newcomers and very familiar for people that come from working with pandas. Optimus expands the Spark DataFrame functionality, adding .rows and .cols attributes.

With Optimus you can clean your data, prepare it, analyze it, create profilers and plots, and perform machine learning and deep learning, all in a distributed fashion, because on the back-end we have Spark, TensorFlow, and Keras.

It’s super easy to us. It’s like the evolution of pandas, with a piece of dplyr, joined by Keras and Spark. The code you create with Optimus will work on your local machine, and with a simple change of masters, it can run on your local cluster or in the cloud.

You will see a lot of interesting functions created to help with every step of the data science cycle.

Optimus is perfect as a companion for an agile methodology for data science because it can help you in almost all the steps of the process, and it can easily connect to other libraries and tools.

If you want to read more about an Agile DS Methodology check this out:

pip install optimuspyspark

As one example, you can load data from a url, transform it, and apply some predefined cleaning functions:

from optimus import Optimus
op = Optimus()
# This is a custom function
def func(value, arg):
return "this was a number"

df =op.load.url("https://raw.githubusercontent.com/ironmussa/Optimus/master/examples/foo.csv")
df\
.rows.sort("product","desc")\
.cols.lower(["firstName","lastName"])\
.cols.date_transform("birth", "new_date", "yyyy/MM/dd", "dd-MM-YYYY")\
.cols.years_between("birth", "years_between", "yyyy/MM/dd")\
.cols.remove_accents("lastName")\
.cols.remove_special_chars("lastName")\
.cols.replace("product","taaaccoo","taco")\
.cols.replace("product",["piza","pizzza"],"pizza")\
.rows.drop(df["id"]<7)\
.cols.drop("dummyCol")\
.cols.rename(str.lower)\
.cols.apply_by_dtypes("product",func,"string", data_type="integer")\
.cols.trim("*")\
.show()

You can transform this:

into this:

Pretty cool, right?

You can do a thousand more things with the library, so please check it out:

https://spacy.io/

From the creators:

spaCy is designed to help you do real work — to build real products, or gather real insights. The library respects your time, and tries to avoid wasting it. It’s easy to install, and its API is simple and productive. We like to think of spaCy as the Ruby on Rails of Natural Language Processing.

spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, Scikit-learn, Gensim, and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.

pip3 install spacy
$ python3 -m spacy download en

Here, we’re also downloading the English language model. You can find models for German, Spanish, Italian, Portuguese, French, and more here:

Here’s an example from the main webpage:

# python -m spacy download en_core_web_sm
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')
# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at "
u"Google in 2007, few people outside of the company took him "
u"seriously. “I can tell you very senior CEOs of major American "
u"car companies would shake my hand and turn away because I wasn’t "
u"worth talking to,” said Thrun, now the co-founder and CEO of "
u"online higher education startup Udacity, in an interview with "
u"Recode earlier this week.")
doc = nlp(text)
# Find named entities, phrases and concepts
for entity in doc.ents:
print(entity.text, entity.label_)
# Determine semantic similarities
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)

In this example, we first download the English tokenizer, tagger, parser, NER, and word vectors. Then we create some text, and finally we print the entities, phrases, and concepts found, and then we determine the semantic similarity of the two phrases. If you run this code you get this:

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE
my fries were super gross such disgusting fries 0.7139701635071919

Very simple and super useful. There is also a spaCy Universe, where you can find great resources developed with or for spaCy. It includes standalone packages, plugins, extensions, educational materials, operational utilities, and bindings for other languages:

By the way, the usage page is great, with very good explanations and code:

Take a look at the visualizers page. Awesome features, here:

For me, this is one of the packages of the year. It’s such an important part of what we do as data scientists. Almost all of us work in notebooks like Jupyter, but we also use IDEs like PyCharm for more hardcore parts of our projects.

The good news is that plain scripts, which you can draft and test in your favorite IDE, open transparently as notebooks in Jupyter when using Jupytext. Run the notebook in Jupyter to generate the outputs, associate an .ipynb representation, and save and share your research as either a plain script or as a traditional Jupyter notebook with outputs.

You can see a workflow of what you can do with the package in the gif below:

Install Jupytext with:

pip install jupytext --upgrade

Then, configure Jupyter to use Jupytext:

  • generate a Jupyter config, if you don’t have one yet, with jupyter notebook --generate-config
  • edit .jupyter/jupyter_notebook_config.py and append the following:
c.NotebookApp.contents_manager_class = "jupytext.TextFileContentsManager"
  • and restart Jupyter, i.e. run:
jupyter notebook

You can give it a try here:

https://xkcd.com/1945/

This, for me, is the winner of the year, for Python. If you are in the Python world, most likely you waste a lot of your time trying to create a decent plot. Luckily, we have libraries like Seaborn that make our life easier. But the issue is that their plots are not dynamic.

Then you have Bokeh—an amazing library—but creating interactive plots with it can be a pain in the a**. If you want to know more about Bokeh and interactive plots for Data Science, take a look at these great articles by William Koehrsen:

Chartify is built in top of Bokeh. But it’s also so much simpler.

From the authors:

  • Consistent input data format: Spend less time transforming data to get your charts to work. All plotting functions use a consistent tidy input data format.
  • Smart default styles: Create pretty charts with very little customization required.
  • Simple API: We’ve attempted to make to the API as intuitive and easy to learn as possible.
  • Flexibility: Chartify is built on top of Bokeh, so if you do need more control you can always fall back on Bokeh’s API.
  1. Chartify can be installed via pip:
pip3 install chartify

2. Install chromedriver requirement (Optional. Needed for PNG output):

  • Install Google Chrome.
  • Download the appropriate version of chromedriver for your OS here.
  • Copy the executable file to a directory within your PATH.
  • View directories in your PATH variable: echo $PATH
  • Copy chromedriver to the appropriate directory, e.g.: cp chromedriver /usr/local/bin

Let’s say we want to create this chart:

import pandas as pd
import chartify
# Generate example data
data = chartify.examples.example_data()

Now that we have some example data loaded let’s do some transformations:

total_quantity_by_month_and_fruit = (data.groupby(
[data['date'] + pd.offsets.MonthBegin(-1), 'fruit'])['quantity'].sum()
.reset_index().rename(columns={'date': 'month'})
.sort_values('month'))
print(total_quantity_by_month_and_fruit.head())
month          fruit     quantity
0 2017-01-01 Apple 7
1 2017-01-01 Banana 6
2 2017-01-01 Grape 1
3 2017-01-01 Orange 2
4 2017-02-01 Apple 8

And now we can plot it:

# Plot the data
ch = chartify.Chart(blank_labels=True, x_axis_type='datetime')
ch.set_title("Stacked area")
ch.set_subtitle("Represent changes in distribution.")
ch.plot.area(
data_frame=total_quantity_by_month_and_fruit,
x_column='month',
y_column='quantity',
color_column='fruit',
stacked=True)
ch.show('png')

Super easy to create a plot, and it’s interactive. If you want more examples to create stuff like this:

And more, check the original repo:

https://github.com/tidymodels/infer

Inference, or statistical inference, is the process of using data analysis to deduce properties of an underlying probability distribution.

The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework.

To install the current stable version of infer from CRAN:

install.packages("infer")

Let’s try a simple example on the mtcars dataset to see what the library can do for us.

First, let’s overwrite mtcars so that the variables cyl, vs, am, gear, and carb are factors.

library(infer)
library(dplyr)
mtcars <- mtcars %>%
mutate(cyl = factor(cyl),
vs = factor(vs),
am = factor(am),
gear = factor(gear),
carb = factor(carb))
# For reproducibility
set.seed(2018)

We’ll try hypothesis testing. Here, a hypothesis is proposed so that it’s testable on the basis of observing a process that’s modeled via a set of random variables. Normally, two statistical data sets are compared, or a data set obtained by sampling is compared against a synthetic data set from an idealized model.

mtcars %>%
specify(response = mpg) %>% # formula alt: mpg ~ NULL
hypothesize(null = "point", med = 26) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "median")

Here, we first specify the response and explanatory variables, then we declare a null hypothesis. After that, we generate resamples using bootstrap and finally calculate the median. The result of that code is:

## # A tibble: 100 x 2
## replicate stat
## <int> <dbl>
## 1 1 26.6
## 2 2 25.1
## 3 3 25.2
## 4 4 24.7
## 5 5 24.6
## 6 6 25.8
## 7 7 24.7
## 8 8 25.6
## 9 9 25.0
## 10 10 25.1
## # ... with 90 more rows

One of the greatest parts of this library is the visualize function. This will allow you to visualize the distribution of the simulation-based inferential statistics or the theoretical distribution (or both). For an example, let’s use the flights data set. First, let’s do some data preparation:

library(nycflights13)
library(dplyr)
library(ggplot2)
library(stringr)
library(infer)
set.seed(2017)
fli_small <- flights %>%
na.omit() %>%
sample_n(size = 500) %>%
mutate(season = case_when(
month %in% c(10:12, 1:3) ~ "winter",
month %in% c(4:9) ~ "summer"
)) %>%
mutate(day_hour = case_when(
between(hour, 1, 12) ~ "morning",
between(hour, 13, 24) ~ "not morning"
)) %>%
select(arr_delay, dep_delay, season,
day_hour, origin, carrier)

And now we can run a randomization approach to χ2-statistic:

chisq_null_distn <- fli_small %>%
specify(origin ~ season) %>% # alt: response = origin, explanatory = season
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "Chisq")
chisq_null_distn %>% visualize(obs_stat = obs_chisq, direction = "greater")

or see the theoretical distribution:

fli_small %>%
specify(origin ~ season) %>%
hypothesize(null = "independence") %>%
# generate() ## Not used for theoretical
calculate(stat = "Chisq") %>%
visualize(method = "theoretical", obs_stat = obs_chisq, direction = "right")

For more on this package visit:

https://github.com/sfirke/janitor

Data cleansing is a topic very close to me. I’ve been working with my team at Iron-AI to create a tool for Python called Optimus. You can see more about it here:

But this tool I’m showing you is a very cool package with simple functions for data cleaning.

It has three main functions:

  • perfectly format data.frame column names;
  • create and format frequency tables of one, two, or three variables (think an improved table(); and
  • isolate partially-duplicate records.

Oh, and it’s a tidyverse-oriented package. Specifically, it works nicely with the %>% pipe and is optimized for cleaning data brought in with the readr and readxl packages.

install.packages("janitor")

I’m using the example from the repo, and the data dirty_data.xlsx.

library(pacman) # for loading packages
p_load(readxl, janitor, dplyr, here)
roster_raw <- read_excel(here("dirty_data.xlsx")) # available at http://github.com/sfirke/janitor
glimpse(roster_raw)
#> Observations: 13
#> Variables: 11
#> $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", N...
#> $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lam...
#> $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Tea...
#> $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English",...
#> $ `Hire Date` <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 347...
#> $ `% Allocated` <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
#> $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", ...
#> $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
#> $ Certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science ...
#> $ Certification__1 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", N...
#> $ Certification__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

With this:

roster <- roster_raw %>%
clean_names() %>%
remove_empty(c("rows", "cols")) %>%
mutate(hire_date = excel_numeric_to_date(hire_date),
cert = coalesce(certification, certification_1)) %>% # from dplyr
select(-certification, -certification_1) # drop unwanted columns

With the clean_names() function, we’re telling R that we’re about to use janitor. Then we clean the empty rows and columns, and then using dplyr we change the format for the dates, create a new column with the information of certification and certification_1, and then drop them.

And with this piece of code…

roster %>% get_dupes(first_name, last_name)

we can find duplicated records that have the same name and last name.

The package also introduces the tabyl function that tabulates the data, like table but pipe-able, data.frame-based, and fully featured. For example:

roster %>%
tabyl(subject)
#> subject n percent valid_percent
#> Basketball 1 0.08333333 0.1
#> Chemistry 1 0.08333333 0.1
#> Dean 1 0.08333333 0.1
#> Drafting 1 0.08333333 0.1
#> English 2 0.16666667 0.2
#> Music 1 0.08333333 0.1
#> PE 1 0.08333333 0.1
#> Physics 1 0.08333333 0.1
#> Science 1 0.08333333 0.1
#> <NA> 2 0.16666667 NA

You can do a lot more things with the package, so visit their site and give them some love :)

https://github.com/dreamRs/esquisse

This add-in allows you to interactively explore your data by visualizing it with the ggplot2 package. It allows you to draw bar graphs, curves, scatter plots, and histograms, and then export the graph or retrieve the code generating the graph.

Install from CRAN with :

# From CRAN
install.packages("esquisse")

Then launch the add-in via the RStudio menu. If you don’t have data.framein your environment, datasets in ggplot2 are used.

ggplot2 builder addin

Launch the add-in via the RStudio menu or with:

esquisse::esquisser()

The first step is to choose a data.frame:

Or you can use a dataset directly with:

esquisse::esquisser(data = iris)

After that, you can drag and drop variables to create a plot:

You can find information about the package and sub-menus in the original repo:

https://github.com/rstudio/sparklyr

Sparklyr will allow you to:

  • Connect to Spark from R. The sparklyr package provides a 
    complete dplyr backend.
  • Filter and aggregate Spark datasets, and then bring them into R for 
    analysis and visualization.
  • Use Spark’s distributed machine learning library from R.
  • Create extensions that call the full Spark API and provide 
    interfaces to Spark packages.

You can install the Sparklyr package from CRAN as follows:

install.packages("sparklyr")

You should also install a local version of Spark for development purposes:

library(sparklyr)
spark_install(version = "2.3.1")

The first part of using Spark is always creating a context and connecting to a local or remote cluster.

Here we’ll connect to a local instance of Spark via the spark_connect function:

library(sparklyr)
sc <- spark_connect(master = "local")

We’ll start by copying some datasets from R into the Spark cluster (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):

install.packages(c("nycflights13", "Lahman"))
library(dplyr)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
src_tbls(sc)
## [1] "batting" "flights" "iris"

To start with, here’s a simple filtering example:

# filter by departure delay and print the first few records
flights_tbl %>% filter(dep_delay == 2)
## # Source:   lazy query [?? x 19]
## # Database: spark_connection
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 542 540 2 923
## 3 2013 1 1 702 700 2 1058
## 4 2013 1 1 715 713 2 911
## 5 2013 1 1 752 750 2 1025
## 6 2013 1 1 917 915 2 1206
## 7 2013 1 1 932 930 2 1219
## 8 2013 1 1 1028 1026 2 1350
## 9 2013 1 1 1042 1040 2 1325
## 10 2013 1 1 1231 1229 2 1523
## # ... with more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>

Let’s plot the data on flight delays:

delay <- flights_tbl %>% 
group_by(tailnum) %>%
summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
filter(count > 20, dist < 2000, !is.na(delay)) %>%
collect
# plot delays
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area(max_size = 2)
## `geom_smooth()` using method = 'gam'

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within Sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

Here’s an example where we use ml_linear_regression to fit a linear regression model. We’ll use the built-in mtcars dataset to see if we can predict a car’s fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl). We’ll assume in each case that the relationship between mpg and each of our features is linear.

# copy mtcars into spark
mtcars_tbl <- copy_to(sc, mtcars)
# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
filter(hp >= 100) %>%
mutate(cyl8 = cyl == 8) %>%
sdf_partition(training = 0.5, test = 0.5, seed = 1099)
# fit a linear model to the training dataset
fit <- partitions$training %>%
ml_linear_regression(response = "mpg", features = c("wt", "cyl"))
fit
## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))  
##
## Formula: mpg ~ wt + cyl
##
## Coefficients:
## (Intercept) wt cyl
## 33.499452 -2.818463 -0.923187

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit and the statistical significance of each of our predictors.

summary(fit)
## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))  
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.752 -1.134 -0.499 1.296 2.282
##
## Coefficients:
## (Intercept) wt cyl
## 33.499452 -2.818463 -0.923187
##
## R-Squared: 0.8274
## Root Mean Squared Error: 1.422

Spark machine learning supports a wide array of algorithms and feature transformations, and as illustrated above, it’s easy to chain these functions together with dplyr pipelines.

Check out more about machine learning with sparklyr here:

And more information in general about the package and examples here:

Drake programming

Nope, just kidding. But the name of the package is drake!

https://github.com/ropensci/drake

This is such an amazing package. I’ll create a separate post with more details about it, so wait for that!

Drake is a package created as a general-purpose workflow manager for data-driven tasks. It rebuilds intermediate data objects when their dependencies change, and it skips work when the results are already up to date.

Also, not every run-through starts from scratch, and completed workflows have tangible evidence of reproducibility.

Reproducibility, good management, and tracking experiments are all necessary for easily testing others’ work and analysis. It’s a huge deal in Data Science, and you can read more about it here:

From Zach Scott:

And in an article by me :)

With drake, you can automatically

  1. Launch the parts that changed since last time.
  2. Skip the rest.
# Install the latest stable release from CRAN.
install.packages("drake")
# Alternatively, install the development version from GitHub.
install.packages("devtools")
library(devtools)
install_github("ropensci/drake")

There are some known errors when installing from CRAN. For more on these errors, visit:

I encountered a mistake, so I recommend that for now you install the package from GitHub.

Ok, so let’s reproduce a simple example with a twist:

I added a simple plot to see the linear model within drake’s main example. With this code, you’re creating a plan for executing your whole project.

First, we read the data. Then we prepare it for analysis, create a simple hist, calculate the correlation, fit the model, plot the linear model, and finally create a rmarkdown report.

The code I used for the final report is here:

If we change some of our functions or analysis, when we execute the plan, drake will know what has changed and will only run those changes. It creates a graph so you can see what’s happening:

Graph for analysis

In Rstudio, this graph is interactive, and you can save it to HTML for later analysis.

There are more awesome things that you can do with drake that I’ll show in a future post :)