More dangerous subtleties of JOINs in SQL


Suddenly it's not so clear if the $9.00 is correct or not.

Unlike join duplications, which overstate our results, join misses cause us to understate our results. 

The solution: be a data skeptic

Perhaps the most valuable skill in data analysis is skepticism: what are all the ways my analysis can be wrong? It is all too easy to fit a hypothesis post hoc for why your numbers make sense - far harder is it to sanity check along the way and see if each data transformation confirms your a priori hypothesis. Preventing join duplications and join misses requires a high degree of skepticism.
 

Join duplications

Join duplications have a simple solution: always count the number of rows before and after a join. If you start with 100 transactions and after joining a table you have 140, you need to investigate.

In SQL, this means running a COUNT(*) after each join. If you join several tables (eg. 5+) and don't take any COUNT(*)s along the way, you are taking a large risk. Better to join, check for join duplications, check for join misses, and then incrementally continue joining.

In Python pandas, this is easy to do with a decorator. I use a decorator around pandas.merge() called @guarantee which guarantees that the number of rows going into the join equals the number of rows coming out - else throw an error.

If this guarantee fails, you need some selection mechanism from the right table in order to return just one result (eg. records with the most recent date, lowest/highest ID number). Whatever you do, the values in the join key should be unique. If you don't perform this sampling for unique rows, you'll end up with extra rows and inflated results. 
 

Join misses

Join misses are a little trickier to debug because some NULL values rightly belong in your data, while others are simply introduced by join misses.

You need to investigate these manually. My strategy is to pick a column you expect to be highly populated in the right table and count the number of nulls in that column before and after the join. If they don't match, you have produced join misses and need to investigate if they are legitimate.

In pandas I typically use Series.value_counts(dropna=False) to check the distribution of data in a column before and after joins. If this looks like exploratory data analysis (EDA), that's because it is. You should always be doing EDA even after you initially vet your data set - this is part of the sanity checking process.

Conclusion

A prudent analyst must always be aware of the potential for join duplications and misses. You may have join misses (say in pandas) if you have two columns which are different data types - their values may be the same but the data types themselves are not comparable. You may have join duplicates (in SQL) if your data is case-sensitive because an "A" will be treated the same way as an "a", violating the uniqueness we require on the join key.

One thing I am always surprised to see omitted in introductory SQL courses is an admission of just how tricky joins are. Unless you can safely assume that your data model is perfect (ie. never), joins are extremely prone to introducing errors. If you join data but join it incorrectly (and produce inaccurate results), you are better off not having joined the data at all. Joins should not be used unless you know the risks that come with them. 

SQL joins are not easy, nor are they meant to be intuitive. They require critical thought and foresight. If there's one thing constantly I remind myself when analyzing data, it's to take my time and think about all the ways I could be wrong.