Web scraping is the extraction of data from web pages. But most web pages aren’t designed to accomodate automated data extraction; instead, they’re designed to be easily read by humans, with colors and fonts and pictures and all sorts of junk. This makes web scraping tricky. There are two predominant techniques for web scraping: HTML parsing and browser automation.
Before going on, I must confess a shameful secret: I don’t understand HTML very well. It’s just too ugly to get me interested. Every so often I’ll try to sit down and read about HTML, and I usually get bored and quit right around the time they get to unordered lists (
<ul>). Why couldn’t they just use S-expressions? Do the brackets and explicit close tags actually add anything? Whatever, it doesn’t matter. The bottom line is that I hate dealing with HTML and I’d prefer to avoid it if I possibly can.
So I’m left with browser automation if I want to scrape. But for simple scraping tasks, especially one-off tasks, most browser automation tools seem like overkill. You have to download the thing, then figure out how to use it, wade through documentation, learn the relevant APIs, blah blah blah. Maybe it’s another personal failing, but I hate doing all that stuff, and again I would prefer to avoid it if I possibly can.
So what am I to do when I want to scrape? As usual, the answer is easy: Emacs. And why not? In most cases the data I want to scrape is text, and Emacs is an all-purpose text-handling tool, so really, what else would I use?
As an example, consider Hurriyet Daily News1, an English-language Turkish news site. Comparing it to American news outlets in terms of jounalistic quality, I would it’s like CNN – not a real journalism organization like the New York Times, but also not a propoganda dissemination machine like Fox News. If you want to keep up with Turkish news and you don’t speak Turkish, it’s not a bad option.
Here’s what hurriyetdailynews.com looks like:
The centerpiece of the landing page is a box containing a half dozen or so headlines with accompanying images. These headlines scroll through one at a time. Suppose, for whatever reason, that I’m interested in tracking the headlines that show up in that box. Here’s how I would do it in Emacs.
First, pop the site open in the Emacs brower eww2. It should looke like this (and if it doesn’t, then the website has changed and this post is out of date):
asdf, but you can call yours something else if you want. Our ultimate goal is to copy each of those headlines into the empty buffer, at which point we can do whatever with them.
To do this, we’ll use a keyboard macro3. Steve Yegge once said “I believe I can state without the slightest hint of exaggeration that Emacs keyboard macros are the coolest thing in the entire universe”, and he’s not wrong. Keyboard macros make boring, repetitive tasks quick and even fun (I’ll sometimes spend more time trying to craft the perfect macro than it would have taken me to do it manually). The way keyboard macros work is you start recording, then hit some keys, then stop recording. When you play back the macro, the keys you recorded will be entered again. The meaning of the keys is not recorded, just the keys themselves, so be careful!
Now, pay attention here, because the details are important (except for the details of the headlines, which don’t matter at all). For reference, the first few lines of the headline section looks something like this:
- Move point (cursor) to the beginning of the line in the
ewwbuffer that says Home Page.
- Start recording a keyboard macro. The default binding for this is
TAB. This will jump down to the first topic keyword, which in this case is
WORLD. (This is a link of some kind).
- For whatever reason, the headlines can’t be reached by
TAB-jumping, so move the cursor down three lines (
- The cursor should now be at the beginning of the line that says “Saudi consul…”. If it isn’t, move it there with
C-a. Now highlight the whole line. This can be done by setting the mark and moving to the end of the line (
C-SPC C-e), but it can be done other ways too.
- Copy the highlighted text, or kill it or whatever the weird Emacs terminology is. I use
C-kfor this, but that isn’t the default binding, which I can never remember.
- Jump over to the empty buffer. The default binding for this is
C-x b, which is an unbelievably shitty way do something as common as changing buffers. Anyway, hit that and then enter the name of the empty buffer (
- The cursor should be at the beginning of the buffer, which should have nothing in it. Paste in (or yank or whatever) the copied text. I use
C-vfor this, which again is not the standard binding.
- Enter a newline (
RET). The cursor should be at the beginning of an empty line at the end of the buffer.
- Jump back back to the
ewwbuffer. The cursor should be at the end of the “Saudi consul…” line.
- Stop recording the macro. The default binding for this is
At this point the previously empty buffer should have the first headline, with an inactive cursor at the beginning of an empty line below it, and the active cursor should be at the end of the first headline in the
eww buffer. Good? Okay, now execute the macro with
C-x e. If it worked, the situation should be the same, but with the second headline copied into the other buffer, and the cursor at the end of the second headline in the
eww buffer. Neat, right? If it didn’t work, something got screwed up, and there’s no telling what happened. Undo whatever it did and try again.
There are a few more headlines, so execute the macro as many times as needed to get all of them. For convenience, after hitting
C-x e the first time, the macro can be replayed again by just hitting
The copy buffer should look like this:
And the headlines are scraped! Obviously this was a somewhat labored explanation, but once you get the hang of keyboard macros, this kind of thing can be done very quickly.
Okay, but there are new headlines every day; what if we want to scrape them regularly? It would be annoying to have to fiddle with keyboard macros every time.
Fortunately, macros can be named and saved. Go to your favorite config file or whatever and execute the following4:
It should spit out something like this:
Now, if you wanted to leave it at that, you could, and you would, as far as anyone could tell, have a function that did exactly what the macro did. You could call it, bind it to a key, whatever. However, with a macro as complex as this one, it’s usually better just to write a real function. This can be done without too much trouble, as the bulk of the work is just figuring out what commands the key presses are bound to, and then putting those in the function. It doesn’t have to be fancy.
Here’s a function for scraping Hurriyet based on that macro. It grabs the headlines and then dumps them into a file called
hurriyet-headlines along with a timestamp. Some example output:
And the code itself:
To be clear, this is NOT elegant Elisp, and it definitely does stuff that would be inappropriate in a distributed package. It’s also brittle, as scrapers tend to be – if the Hurriyet website changed its format, I would have to dump it in the trash and start over. Nonetheless, it works fine for personal use.
1 hürriyet is a Turkish word derived from the Arabic حرية meaning freedom.
2 eww is short for Emacs Web Wowser. Really.
3 Note that keyboard macros are completely unrelated to Lisp macros.
kmacro-name-last-macro can be used in place of
name-last-kbd-macro. Its output is a little different:
This one uses numerical key codes, which I find hard to decipher (you can see
97 115 100 102 for
asdf, for instance).