Joining CSV and JSON data with an in-memory SQLite database


The new sqlite-utils memory command can import CSV and JSON data directly into an in-memory SQLite datbase, combine and query it using SQL and output the results as CSV, JSON or various other formats of plain text tables.

sqlite-utils memory

The new feature is part of sqlite-utils 3.10, released this morning.

I’ve recorded this video demonstrating the new feature—with full accompanying notes below.

sqlite-utils already offers a mechanism for importing CSV and JSON data into a SQLite database file, in the form of the sqlite-utils insert command. Processing data with this involves two steps: first import it into a temp.db file, then use sqlite-utils query to run queries and output the results.

Using SQL to re-shape data is really useful—since sqlite-utils can output in multiple different formats, I frequently find myself loading in a CSV file and exporting it back out as JSON, or vice-versa.

This week I realized that I had most of the pieces in place to reduce this to a single step. The new sqlite-utils memory command (full documentation here) operates against a temporary, in-memory SQLite database. It can import data, execute SQL and output the result in a one-liner, without needing any temporary database files along the way.

Here’s an example. My Dogsheep GitHub organization has a number of repositories. GitHub make those available via an authentication-optional API endpoint at https://api.github.com/users/dogsheep/repos—which returns JSON that looks like this (simplified):

[ { "id": 197431109, "name": "dogsheep-beta", "full_name": "dogsheep/dogsheep-beta", "size": 61, "stargazers_count": 79, "watchers_count": 79, "forks": 0, "open_issues": 11 }, { "id": 256834907, "name": "dogsheep-photos", "full_name": "dogsheep/dogsheep-photos", "size": 64, "stargazers_count": 116, "watchers_count": 116, "forks": 5, "open_issues": 18 }
]

With sqlite-utils memory we can see the 3 most popular repos by number of stars like this:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \ | sqlite-utils memory - ' select full_name, forks_count, stargazers_count as stars from stdin order by stars desc limit 3 ' -t
full_name forks_count stars
-------------------------- ------------- -------
dogsheep/twitter-to-sqlite 12 225
dogsheep/github-to-sqlite 14 139
dogsheep/dogsheep-photos 5 116

We’re using curl to fetch the JSON and pipe it into sqlite-utils memory—the - means “read from standard input”. Then we pass the following SQL query:

select full_name, forks_count, stargazers_count as stars
from stdin order by stars desc limit 3

stdin is the temporary table created for the data piped in to the tool. The query selects three of the JSON properties, renames stargazers_count to stars, sorts by stars and return the first three.

The -t option here means “output as a formatted table”—without that option we get JSON:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \ | sqlite-utils memory - ' select full_name, forks_count, stargazers_count as stars from stdin order by stars desc limit 3 ' [{"full_name": "dogsheep/twitter-to-sqlite", "forks_count": 12, "stars": 225}, {"full_name": "dogsheep/github-to-sqlite", "forks_count": 14, "stars": 139}, {"full_name": "dogsheep/dogsheep-photos", "forks_count": 5, "stars": 116}]

Or we can use --csv to get back CSV:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \ | sqlite-utils memory - ' select full_name, forks_count, stargazers_count as stars from stdin order by stars desc limit 3 ' --csv
full_name,forks_count,stars
dogsheep/twitter-to-sqlite,12,225
dogsheep/github-to-sqlite,14,139
dogsheep/dogsheep-photos,5,116

The -t option supports a number of different formats, specified using --fmt. If I wanted to generate a LaTeX table of the top reos by stars I could do this:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \ | sqlite-utils memory - ' select full_name, forks_count, stargazers_count as stars from stdin order by stars desc limit 3 ' -t --fmt=latex
\begin{tabular}{lrr}
\hline full\_name & forks\_count & stars \\
\hline dogsheep/twitter-to-sqlite & 12 & 225 \\ dogsheep/github-to-sqlite & 14 & 139 \\ dogsheep/dogsheep-photos & 5 & 116 \\
\hline
\end{tabular}

We can run aggregate queries too—let’s add up the total size and total number of stars across all of those repositories:

$ curl -s 'https://api.github.com/users/dogsheep/repos' \
| sqlite-utils memory - ' select sum(size), sum(stargazers_count) from stdin
' -t sum(size) sum(stargazers_count)
----------- ----------------------- 843 934

(I believe size here is measured in kilobytes: the GitHub API documentation isn’t clear on this point.)

Joining across different files

All of these examples have worked with JSON data piped into the tool—but you can also pass one or more files, of different formats, in a way that lets you execute joins against them.

As an example, lets combine two sources of data.

The New York Times publish a us-states.csv file with Covid cases and deaths by state over time.

The CDC have an undocumented JSON endpoint (which I’ve been archiving here) tracking the progress of vaccination across different states.

We’re going to run a join from that CSV data to that JSON data, and output a table of results.

First, we need to download the files. The CDC JSON data isn’t quite in the right shape for our purposes:

{ "runid": 2023, "vaccination_data": [ { "Date": "2021-06-19", "Location": "US", "ShortName": "USA", ...

sqlite-utils expects a flat JSON array of objects—we can use jq to re-shape the data like so:

$ curl https://covid.cdc.gov/covid-data-tracker/COVIDData/getAjaxData?id=vaccination_data \ | jq .vaccination_data > vaccination_data.json

The New York Times data is good as is:

$ wget 'https://github.com/nytimes/covid-19-data/raw/master/us-states.csv'

Now that we have the data locally, we can run a join to combine it using the following command:

$ sqlite-utils memory us-states.csv vaccination_data.json " select max(t1.date), t1.state, t1.cases, t1.deaths, t2.Census2019, t2.Dist_Per_100K from t1 join t2 on t1.state = replace(t2.LongName, 'New York State', 'New York') group by t1.state order by Dist_Per_100K desc
" -t
max(t1.date) state cases deaths Census2019 Dist_Per_100K
-------------- ------------------------ ------- -------- ------------ ---------------
2021-06-18 District of Columbia 49243 1141 705749 149248
2021-06-18 Vermont 24360 256 623989 146257
2021-06-18 Rhode Island 152383 2724 1059361 141291
2021-06-18 Massachusetts 709263 17960 6892503 139692
2021-06-18 Maryland 461852 9703 6045680 138193
2021-06-18 Maine 68753 854 1344212 136894
2021-06-18 Hawaii 35903 507 1415872 136024
...

I’m using automatically created numeric aliases t1 and t2 for the files here, but I can also use their full table names "us-states" (quotes needed due to the hyphen) and vaccination_data instead.

The replace() operation there is needed because the vaccination_data.json file calls New York “New York State” while the us-states.csv file just calls it “New York”.

The max(t1.date) and group by t1.state is a useful SQLite trick: if you perform a group by and then ask for the max() of a value, the other columns returned from that table will be the columns for the row that contains that maximum value.

This demo is a bit of a stretch—once I reach this level of complexity I’m more likely to load the files into a SQLite database file on disk and open them up in Datasette—but it’s a fun example of a more complex join in action.

Also in sqlite-utils 3.10

The sqlite-utils memory command has another new trick up its sleeve: it automatically detects which columns in a CSV or TSV file contain integer or float values and creates the corresponding in-memory SQLite table with the correct types. This ensures max() and sum() and order by work in a predictable manner, without accidentally sorting 1 as higher than 11.

I didn’t want to break backwards compatibility for existing users of the sqlite-utils insert command so I’ve added type detection there as a new option, --detect-types or -d for short:

$ sqlite-utils insert my.db us_states us-states.csv --csv -d [####################################] 100%
$ sqlite-utils schema my.db
CREATE TABLE "us_states" ( [date] TEXT, [state] TEXT, [fips] INTEGER, [cases] INTEGER, [deaths] INTEGER
);

There’s more in the changelog.

Releases this week

TIL this week