A Practical Introduction to Web Scraping in Python – Real Python

By Real Python

Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that’s explicitly designed for parsing out HTML pages. There are many Python tools written for this purpose, but the Beautiful Soup library is a good one to start with.

To install Beautiful Soup, you can run the following in your terminal:

$ python3 -m pip install beautifulsoup4

Run pip show to see the details of the package you just installed:

$ python3 -m pip show beautifulsoup4
Name: beautifulsoup4
Version: 4.9.1
Summary: Screen-scraping library
Home-page: http://www.crummy.com/software/BeautifulSoup/bs4/
Author: Leonard Richardson
Author-email: leonardr@segfault.org
License: MIT
Location: c:\realpython\venv\lib\site-packages
Requires:
Required-by:

In particular, notice that the latest version at the time of writing was 4.9.1.

Type the following program into a new editor window:

from bs4 import BeautifulSoup
from urllib.request import urlopen url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

This program does three things:

  1. Opens the URL http://olympus.realpython.org/profiles/dionysus using urlopen() from the urllib.request module

  2. Reads the HTML from the page as a string and assigns it to the html variable

  3. Creates a BeautifulSoup object and assigns it to the soup variable

The BeautifulSoup object assigned to soup is created with two arguments. The first argument is the HTML to be parsed, and the second argument, the string "html.parser", tells the object which parser to use behind the scenes. "html.parser" represents Python’s built-in HTML parser.

Save and run the above program. When it’s finished running, you can use the soup variable in the interactive window to parse the content of html in various ways.

For example, BeautifulSoup objects have a .get_text() method that can be used to extract all the text from the document and automatically remove any HTML tags.

Type the following code into IDLE’s interactive window:

>>>
>>> print(soup.get_text()) Profile: Dionysus Name: Dionysus Hometown: Mount Olympus Favorite animal: Leopard Favorite Color: Wine

There are a lot of blank lines in this output. These are the result of newline characters in the HTML document’s text. You can remove them with the string .replace() method if you need to.

Often, you need to get only specific text from an HTML document. Using Beautiful Soup first to extract the text and then using the .find() string method is sometimes easier than working with regular expressions.

However, sometimes the HTML tags themselves are the elements that point out the data you want to retrieve. For instance, perhaps you want to retrieve the URLs for all the images on the page. These links are contained in the src attribute of <img> HTML tags.

In this case, you can use find_all() to return a list of all instances of that particular tag:

>>>
>>> soup.find_all("img")
[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

This returns a list of all <img> tags in the HTML document. The objects in the list look like they might be strings representing the tags, but they’re actually instances of the Tag object provided by Beautiful Soup. Tag objects provide a simple interface for working with the information they contain.

Let’s explore this a little by first unpacking the Tag objects from the list:

>>>
>>> image1, image2 = soup.find_all("img")

Each Tag object has a .name property that returns a string containing the HTML tag type:

You can access the HTML attributes of the Tag object by putting their name between square brackets, just as if the attributes were keys in a dictionary.

For example, the <img src="/static/dionysus.jpg"/> tag has a single attribute, src, with the value "/static/dionysus.jpg". Likewise, an HTML tag such as the link <a href="https://realpython.com" target="_blank"> has two attributes, href and target.

To get the source of the images in the Dionysus profile page, you access the src attribute using the dictionary notation mentioned above:

>>>
>>> image1["src"]
'/static/dionysus.jpg' >>> image2["src"]
'/static/grapes.png'

Certain tags in HTML documents can be accessed by properties of the Tag object. For example, to get the <title> tag in a document, you can use the .title property:

>>>
>>> soup.title
<title>Profile: Dionysus</title>

If you look at the source of the Dionysus profile by navigating to the profile page, right-clicking on the page, and selecting View page source, then you’ll notice that the <title> tag as written in the document looks like this:

<title >Profile: Dionysus</title/>

Beautiful Soup automatically cleans up the tags for you by removing the extra space in the opening tag and the extraneous forward slash (/) in the closing tag.

You can also retrieve just the string between the title tags with the .string property of the Tag object:

>>>
>>> soup.title.string
'Profile: Dionysus'

One of the more useful features of Beautiful Soup is the ability to search for specific kinds of tags whose attributes match certain values. For example, if you want to find all the <img> tags that have a src attribute equal to the value /static/dionysus.jpg, then you can provide the following additional argument to .find_all():

>>>
>>> soup.find_all("img", src="/static/dionysus.jpg")
[<img src="/static/dionysus.jpg"/>]

This example is somewhat arbitrary, and the usefulness of this technique may not be apparent from the example. If you spend some time browsing various websites and viewing their page sources, then you’ll notice that many websites have extremely complicated HTML structures.

When scraping data from websites with Python, you’re often interested in particular parts of the page. By spending some time looking through the HTML document, you can identify tags with unique attributes that you can use to extract the data you need.

Then, instead of relying on complicated regular expressions or using .find() to search through the document, you can directly access the particular tag you’re interested in and extract the data you need.

In some cases, you may find that Beautiful Soup doesn’t offer the functionality you need. The lxml library is somewhat trickier to get started with but offers far more flexibility than Beautiful Soup for parsing HTML documents. You may want to check it out once you’re comfortable using Beautiful Soup.

Note: HTML parsers like Beautiful Soup can save you a lot of time and effort when it comes to locating specific data in web pages. However, sometimes HTML is so poorly written and disorganized that even a sophisticated parser like Beautiful Soup can’t interpret the HTML tags properly.

In this case, you’re often left with using .find() and regular expression techniques to try to parse out the information you need.

BeautifulSoup is great for scraping data from a website’s HTML, but it doesn’t provide any way to work with HTML forms. For example, if you need to search a website for some query and then scrape the results, then BeautifulSoup alone won’t get you very far.

Expand the block below to check your understanding.

Write a program that grabs the full HTML from the page at the URL http://olympus.realpython.org/profiles.

Using Beautiful Soup, print out a list of all the links on the page by looking for HTML tags with the name a and retrieving the value taken on by the href attribute of each tag.

The final output should look like this:

http://olympus.realpython.org/profiles/aphrodite
http://olympus.realpython.org/profiles/poseidon
http://olympus.realpython.org/profiles/dionysus

You can expand the block below to see a solution:

First, import the urlopen function from the urlib.request module and the BeautifulSoup class from the bs4 package:

from urllib.request import urlopen
from bs4 import BeautifulSoup

Each link URL on the /profiles page is a relative URL, so create a base_url variable with the base URL of the website:

base_url = "http://olympus.realpython.org"

You can build a full URL by concatenating base_url with a relative URL.

Now open the /profiles page with urlopen() and use .read() to get the HTML source:

html_page = urlopen(base_url + "/profiles")
html_text = html_page.read().decode("utf-8")

With the HTML source downloaded and decoded, you can create a new BeautifulSoup object to parse the HTML:

soup = BeautifulSoup(html_text, "html.parser")

soup.find_all("a") returns a list of all links in the HTML source. You can loop over this list to print out all the links on the webpage:

for link in soup.find_all("a"): link_url = base_url + link["href"] print(link_url)

The relative URL for each link can be accessed through the "href" subscript. Concatenate this value with base_url to create the full link_url.

When you’re ready, you can move on to the next section.