Analyzing Hacker News book suggestions in Python

By Alessandro Mozzato

A few days ago the traditional “what books did you read this year” thread popped up on Hacker News. The thread is full of very nice book suggestions. Attempting to make a reading list for next year I though it would be fun to get the data and analyze it. In the following article I will show how I used Hacker News’ API to scrape the posts content, how I selected the most common titles and checked them against Goodreads API and finally how I came up with the definitive top 20 most recommended books. As always, dealing with text data is anything but straightforward. The final result, however, is quite satisfying!

The first step is getting the data. Luckily, Hacker News provides a very nice API to freely scrape all of its content. The API has endpoints for posts, users, top posts a few others. For this article we will use the one for posts. It’s very simple to use, here is the basic syntax: v0/item/{id}/.json where id is the item we are interested in. In this case the thread’s id is 18661546, so here is an example on how to get the main page data:

import requests 
main _page = requests.request(‘GET’, ‘https://hackernews.firebaseio.com/v0/item/18661546.json').json())

The same API call is also used for the sub posts of a thread or a post, whose ids can be found in the kids key of the parent post. Looping over the kids we can get the text of every post in the thread.

Now that we have the text data we want to extract book titles from it. One possible approach would be to look for all Amazon or Goodreads links in the article and just group by that. This is a clean approach because it doesn’t depend on any text processing. However, just from taking a quick look at the thread it is clear that the vast majority of suggestions do not have any link associated to them. So I decided to go for the more difficult route: grouping ngrams together and match those ngrams with possible books.

So, after eliminating special characters from the text I grouped together bigrams, trigrams, 4-grams and 5-grams and count the occurrences. This is an example to count bigrams:

import re
from collections import Counter
import operator
# clean special characters
text_clean = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for t in text for k in t.split("\n")]
# count occurrences of bigrams in different posts
countsb = Counter()
words = re.compile(r'\w+')
for t in text_clean:
w = words.findall(t.lower())
countsb.update(zip(w,w[1:]))
# sort results
bigrams = sorted(
countsb.items(),
key=operator.itemgetter(1),
reverse=True
)

Usually in text application one of the first thing to do while processing the data is to eliminate stopwords, i.e. the most common words in a language, like articles and prepositions. In our case we did not eliminate stopwords from our text yet, therefore most of these ngrams would be almost exclusively composed of stopwords. In fact, here is a sample output of the top 10 most common bigrams in our data:

[((u'of', u'the'), 147),
((u'in', u'the'), 76),
((u'it', u's'), 67),
((u'this', u'book'), 52),
((u'this', u'year'), 49),
((u'if', u'you'), 45),
((u'and', u'the'), 44),
((u'i', u've'), 44),
((u'to', u'the'), 40),
((u'i', u'read'), 37)]

Having stopwords in our data is fine, most title books would have stopwords in them so we want to keep these. However, to avoid looking up too many combinations we eliminate the ngrams that are solely composed of stopwords, keeping all the others.

Now that we have a list of possible ngrams we will use the Goodreads API to check if these ngrams correspond to book titles. In case multiple matches are available for a search I decided to take the most recent publication as the result of the search. This is assuming that the most recent book would be the most likely match for this context. This is of course an assumption that might lead to errors.

The Goodreads API is a bit less straightforward to use than the Hacker News one as it returns results in XML, which is less friendly to use than the JSON format. In this analysis I used the xmltodict python package to convert the XML to JSON. The API method we need is search.books which allows to search books by title, author or ISBN. Here is a code sample to get book title and author for the most recently published search result:

import xmltodict
res = requests.get("https://www.goodreads.com/search/index.xml" , params={"key": grkey, "q":'some book title'})
xpars = xmltodict.parse(res.text)
json1 = json.dumps(xpars)
d = json.loads(json1)
lst = d['GoodreadsResponse']['search']['results']['work']
ys = [int(lst[j]['original_publication_year']['#text']) for j in range(len(lst))]
title = lst[np.argmax(ys)]['best_book']['title']
author = lst[np.argmax(ys)]['best_book']['author']['name']

This method allows us to associate ngrams to possible books. We check the list of books we get matching all ngrams with the Goodreads API against the full text data. Before performing the actual check we cut the book names eliminating punctuation (particularly semicolumns) and subtitles. We only consider the main title with assumption that most of the time only this part of the title would be used (some of the full titles in the list are actually really long!). Ranking the results we get by number of occurences in the thread we get this list:

Books with more than 3 counts in the thread

So Bad Blood looks to be the top most recommended book in the thread. Checking the other results most of them seems to make sense and match with the thread, including the counts. The only big mistake I could spot in the list is for position number 2, where the book Magi was identified instead of The Magicians by Lev Grossman. The latter is indeed cited 7 times in the text. This error is caused by the assumption we made of considering the most recent book out of the results from the Goodreads API. As for results that were on the original data and do not appear in the list I could not spot any obvious one beside The three body problem. This book or others from the same trilogy are cited several times in the text but because they are referred by different names or with different punctuation they were not picked up by this method. A way to solve this could be to use fuzzy matching in this step.

In conclusion in this article I have showed how I extracted data from Hacker News, parsed it to extract book titles, checked them using the Goodreads API and matched the final list with the original text. The task proved to be quite complex as it required several assumptions and dealing with two different APIs. Moreover the final result still had some incorrect results.

Nonetheless I managed to get a good final result. This is the list of the top 20 books recommended by Hacker News:

  • Bad Blood: Secrets and Lies in a Silicon by Valley John Carreyrou
  • Why We Sleep: Unlocking the Power of Sleep by Matthew Walker
  • The Magicians by Lev Grossman
  • Shoe Dog: A Memoir by the Creator of NIKE by Phil Knight
  • How to Change Your Mind by Michael Pollan
  • Factfulness: Ten Reasons We’re Wrong About the World by Hans Rosling
  • Man’s Search for Meaning by Viktor E. Frankl
  • Deep Work by Cal Newport
  • Homo Deus: A Brief History of Tomorrow by Yuval Noah Harari
  • The Phoenix Project by D.M. Cain
  • 21 Lessons for the 21st Century by Yuval Noah Harari
  • Thinking in Systems: A Primer by Tia T. Farmer
  • Leonardo da Vinci by Walter Isaacson
  • Never Split the Difference by Chris Voss
  • Extreme Ownership by Jocko Willink
  • Linear Algebra by Jim Hefferon
  • 12 Rules for Life: An Antidote to Chaos by Jordan B. Peterson
  • Prisoners of Geography by Tim Marshall
  • Skin in the Game by Nassim Nicholas Taleb
  • Atomic Habits by James Clear

The source can be viewed on Github. Comments or criticism of any type would be really appreciated.