Django + Elasticsearch. Searching for awesome TED Talks


In the article we’re going to cover the basics of Elasticsearch: what it is, how to run it, how to communicate with it.

Then, after we feel comfortable with Elasticsearch, we will start developing a simple web application with Django that will use the power of Elasticsearch.

Running Elasticsearch

Before diving into the concepts of Elasticsearch we should run it first.

When you start reading about Elasticsearch on the internet, you will definitely stumble upon Elastic Stack term. Elastic Stack is an ecosystem of different tools: Kibana, Logstash, Beats and Elasticsearch itself. In the article we’re going to use only Elasticsearch.

The simplest way to run Elasticsearch is to use Docker. All you need is to pull an image from DockerHub, run the container and it just works:

$ docker pull elasticsearch:6.6.1
$ docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" elasticsearch:6.6.1

If you did everything correctly you should be able to send HTTP request to Elasticsearch:

$ http GET localhost:9200
HTTP/1.1 200 OK
content-encoding: gzip
content-length: 298
content-type: application/json; charset=UTF-8 { "cluster_name": "docker-cluster", "cluster_uuid": "9aZz9ThdRUSvhgWJn5GaOQ", "name": "DsBD6kQ", "tagline": "You Know, for Search", "version": { "build_date": "2019-02-13T17:10:04.160291Z", "build_flavor": "default", "build_hash": "1fd8f69", "build_snapshot": false, "build_type": "tar", "lucene_version": "7.6.0", "minimum_index_compatibility_version": "5.0.0", "minimum_wire_compatibility_version": "5.6.0", "number": "6.6.1" }
}

I’ve used here and will be using later httpie library for sending HTTP requests. You can use httpie, curl or any other client for sending requests. It’s up to you.

Elasticsearch Essentials

Elasticsearch is a search engine. Some of the reasons why you should consider using Elasticsearch for searching data in favor of regular SQL databases like PostgreSQL or MySQL:

  1. it works really fast;
  2. it scales;
  3. it has a lot of cool components like Kibana for visualizing data, Logstash for collecting and transforming data and etc.

Communication with Elasticsearch happens through RESTful API. Because of that, all operations are available over HTTP. If you’ve already used RESTful APIs before, the communication should be smooth for you.

Elasticsearch is NoSQL database. It stores data as JSON documents and it doesn’t impose a strict structure on your data which means that you can put anything you want in your JSON document.

These JSON documents are organized within types and indexes. Each document belongs to a type. The type itself lives in the index. Each index can contain only a single type.

We can draw a parallel between how Elasticsearch stores data and how any relational database stores data:

Relational DB → Databases → Tables → Rows → Columns

Elasticsearch → Indexes → Types → Documents → Fields

The only difference is that in relational databases each database can have many tables. But in Elasticsearch each index can only have one type.

In previous Elasticsearch versions though, an index could have more than one type, but right now it’s deprecated. If you want to learn why is that, you can read about it here.

Elasticsearch Requests

Time to get our hands dirty and make our first request to Elasticsearch:

PUT /tweets/doc/1
{ "author": "Elon Musk", "text": "This might be my finest work", "likes": 43000
} { "_index" : "tweets", "_type" : "doc", "_id" : "1", "_version" : 1, "result" : "created", "_shards" : { "total" : 2, "successful" : 1, "failed" : 0 }, "_seq_no" : 0, "_primary_term" : 1
}

In this one request we actually did a lot of things:

  1. Created an index tweets
  2. Created a type doc
  3. Indexed a new document which contains information about a tweet

Now let’s index a couple of other documents:

PUT /tweets/doc/2
{ "author": "Elon Musk", "text": "Thank you!", "likes": 42000
} PUT /tweets/doc/3
{ "author": "Neil deGrasse Tyson", "text": "The rise of flat-Earthers in the United States is evidence of two things: 1) the protection of speech, and 2) the failure of our educational system.", "likes": 110000
} PUT /tweets/doc/4
{ "author": "Elon Musk", "text": "@apirobotme your blog is the best blog about web development I have ever read. Thank you!", "likes": 1000000
}

In order to retrieve an individual document, we can make a GET request and specify an address of this document. Its index, type and id:

GET /tweets/doc/2
{ "_index" : "tweets", "_type" : "doc", "_id" : "2", "_version" : 1, "_seq_no" : 0, "_primary_term" : 1, "found" : true, "_source" : { "author" : "Elon Musk", "text" : "Thank you!", "likes" : 42000 }
}

Simple search for all documents looks like this:

GET /tweets/doc/_search { "took" : 19, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 4, "max_score" : 1.0, "hits" : [ { "_index" : "tweets", "_type" : "doc", "_id" : "2", "_score" : 1.0, "_source" : { "author" : "Elon Musk", "text" : "Thank you!", "likes" : 42000 } }, { "_index" : "tweets", "_type" : "doc", "_id" : "4", "_score" : 1.0, "_source" : { "author" : "Elon Musk", "text" : "@apirobotme your blog is the best blog about web development I have ever read. Thank you!", "likes" : 1000000 } }, { "_index" : "tweets", "_type" : "doc", "_id" : "1", "_score" : 1.0, "_source" : { "author" : "Elon Musk", "text" : "This might be my finest work", "likes" : 43000 } }, { "_index" : "tweets", "_type" : "doc", "_id" : "3", "_score" : 1.0, "_source" : { "author" : "Neil deGrasse Tyson", "text" : "The rise of flat-Earthers in the United States is evidence of two things: 1) the protection of speech, and 2) the failure of our educational system.", "likes" : 110000 } } ] }
}

But the “search” we did right now is not useful because it just returned a list of all documents. Let’s make a real search and try to find documents with specific author:

GET /tweets/doc/_search
{ "query": { "match": { "author": "elon" } }
} { "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 0.2876821, "hits" : [ { "_index" : "tweets", "_type" : "doc", "_id" : "1", "_score" : 0.2876821, "_source" : { "author" : "Elon Musk", "text" : "This might be my finest work", "likes" : 43000 } }, { "_index" : "tweets", "_type" : "doc", "_id" : "2", "_score" : 0.18232156, "_source" : { "author" : "Elon Musk", "text" : "Thank you!", "likes" : 42000 } }, { "_index" : "tweets", "_type" : "doc", "_id" : "4", "_score" : 0.18232156, "_source" : { "author" : "Elon Musk", "text" : "@apirobotme your blog is the best blog about web development I have ever read. Thank you!", "likes" : 1000000 } } ] }
}

We can go further and make it more complicated. For example, let’s do a full-text search on text field and also filter tweets based on the number of likes they have:

GET /tweets/doc/_search
{ "query" : { "bool": { "must": { "match": { "text" : "thank" } }, "filter": { "range" : { "likes" : { "gt" : 50000 } } } } }
} { "took" : 90, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.1383129, "hits" : [ { "_index" : "tweets", "_type" : "doc", "_id" : "4", "_score" : 0.1383129, "_source" : { "author" : "Elon Musk", "text" : "@apirobotme your blog is the best blog about web development I have ever read. Thank you!", "likes" : 1000000 } } ] }
}

Or instead of filtering tweets by likes we can sort them and show the most liked tweets first:

GET /tweets/doc/_search
{ "query": { "function_score": { "query": { "match": { "text" : "thank" } }, "field_value_factor": { "field": "likes" } } }
} { "took" : 36, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 138312.9, "hits" : [ { "_index" : "tweets", "_type" : "doc", "_id" : "4", "_score" : 138312.9, "_source" : { "author" : "Elon Musk", "text" : "@apirobotme your blog is the best blog about web development I have ever read. Thank you!", "likes" : 1000000 } }, { "_index" : "tweets", "_type" : "doc", "_id" : "2", "_score" : 11231.009, "_source" : { "author" : "Elon Musk", "text" : "Thank you!", "likes" : 42000 } } ] }
}

Django and Elasticsearch in Action

Now when you have an idea of what Elasticsearch is and how to work with it, let’s create a simple web application with Django which will use the power of Elasticsearch.

We will try to create a web application that will help us to search for awesome TED talks. The end result looks like this:

First, we will download raw data on TED talks like their names, descriptions, transcripts and etc. Then we will insert this data into relational database and index some parts of the data from relational database into Elasticsearch. After that, we will create a page which will interact with Elasticsearch and show the most relevant and popular talks based on the search query that the user will type on this page.

You can find the source code of this application here.

Setting up the project

Our project will use PostgreSQL as a relational database, Elasticsearch and Django. The simplest way to set up everything is to use Docker. We have already used Docker previously when we run Elasticsearch in a container. But now we will have 3 containers. One for PostgreSQL, one for Elasticsearch and one for Django web application.

Let’s start from creating an empty project:

django-admin startproject ted

Adding requirements.txt file:

# requirements.txt
Django==2.2
psycopg2==2.8.1
django-elasticsearch-dsl==0.5.1
djangorestframework==3.9.2
pyexcel==0.5.13
pyexcel-io==0.5.17

And updating project’s settings file:

# ted/settings.py ... INSTALLED_APPS = [ ... 'django_elasticsearch_dsl', 'rest_framework',
] ... DATABASES = { 'default': { 'ENGINE': 'django.db.backends.postgresql_psycopg2', 'NAME': os.environ.get('POSTGRES_DB'), 'USER': os.environ.get('POSTGRES_USER'), 'PASSWORD': os.environ.get('POSTGRES_PASSWORD'), 'HOST': os.environ.get('POSTGRES_HOST'), 'PORT': os.environ.get('POSTGRES_PORT'), }
} ... # Elasticsearch ELASTICSEARCH_DSL = { 'default': { 'hosts': 'elasticsearch:9200' },
} # Django REST Framework REST_FRAMEWORK = { 'DEFAULT_PAGINATION_CLASS': 'rest_framework.pagination.PageNumberPagination', 'PAGE_SIZE': 10,
}

Now we’re ready to configure Docker.

Let’s create a Dockerfile for our web application:

# docker/web/Dockerfile
FROM python:3.7-alpine ENV PYTHONUNBUFFERED 1 RUN apk update \\ # psycopg2 dependencies && apk add --virtual build-deps gcc python3-dev musl-dev \\ && apk add postgresql-dev \\ && apk add postgresql-client COPY ./requirements.txt /requirements.txt
RUN pip install -r /requirements.txt COPY ./docker/web/start /start
RUN sed -i 's/\\r//' /start
RUN chmod +x /start WORKDIR /app

And add start bash script which we will always execute after starting a container with web application:

# docker/web/start
#!/bin/sh set -o errexit
set -o pipefail
set -o nounset python /app/manage.py migrate
python /app/manage.py runserver 0.0.0.0:8000

Like I said earlier we will have 3 containers and these containers should be able to communicate. For example, our web application should be able to send queries to PostgreSQL and Elasticsearch containers. The best tool for running multi-container Docker applications is Docker Compose. With Docker Compose we use YAML file to configure our application’s services. It looks like this:

# docker-compose.yml
version: '3' volumes: pgdata: {} esdata: {} services: web: build: context: . dockerfile: ./docker/web/Dockerfile image: ted_web volumes: - .:/app depends_on: - postgres - elasticsearch env_file: .env ports: - "8000:8000" command: /start postgres: image: postgres:10.5 env_file: .env volumes: - pgdata:/var/lib/postgresql/data elasticsearch: image: elasticsearch:6.6.1 volumes: - esdata:/usr/share/elasticsearch/data environment: - discovery.type=single-node ports: - "9200:9200"

Before running Docker Compose we should create .env file. In this file we will store some important environment variables:

# .env
POSTGRES_HOST=postgres
POSTGRES_PORT=5432
POSTGRES_DB=ted
POSTGRES_USER=pgadmin
POSTGRES_PASSWORD=maestro

Now we’re good to go:

$ docker-compose up

When the containers are ready, you should be able to open the main page:

The last step for setting up the project is to create a django app:

$ django-admin startapp talks

And add it to the list of installed apps:

# ted/settings.py ... INSTALLED_APPS = [ ... 'talks',
]

Inserting data into relational database

Let’s create a model:

# talks/models.py
from django.db import models class Talk(models.Model): name = models.CharField(max_length=200) description = models.TextField() speaker = models.CharField(max_length=200) url = models.URLField() number_of_views = models.PositiveIntegerField() transcript = models.TextField() def __str__(self): return self.name

And run migrations:

$ docker-compose run --rm web python manage.py makemigrations
$ docker-compose run --rm web python manage.py migrate

Now when we have a table in the database and a model which operates on this table, we should fill the database with information about TED talks.

I will download csv files filled with data from kaggle and process these files using pyexcel library. If you’re going along, make sure that you copied ted_main.csv and transcripts.csv files into your project’s root directory.

The script which processes files:

# talks/models.py
from pyexcel import get_sheet
from django.db import models class Talk(models.Model): ... @classmethod def populate(cls): cls.objects.all().delete() url_to_talk_data = {} ted_sheet = get_sheet( file_name='ted_main.csv', name_columns_by_row=0 ) for name, description, speaker, url, number_of_views in zip( ted_sheet.column['name'], ted_sheet.column['description'], ted_sheet.column['main_speaker'], ted_sheet.column['url'], ted_sheet.column['views'], ): url = url.strip() url_to_talk_data[url] = { 'name': name, 'description': description, 'speaker': speaker, 'url': url, 'number_of_views': number_of_views, 'transcript': '', } transcripts_sheet = get_sheet( file_name='transcripts.csv', name_columns_by_row=0 ) for transcript, url in zip( transcripts_sheet.column['transcript'], transcripts_sheet.column['url'], ): url = url.strip() if url in url_to_talk_data: url_to_talk_data[url].update({ 'transcript': transcript, }) cls.objects.bulk_create([ cls( name=talk_data['name'], description=talk_data['description'], speaker=talk_data['speaker'], url=talk_data['url'], number_of_views=talk_data['number_of_views'], transcript=talk_data['transcript'], ) for talk_data in url_to_talk_data.values() ])

Open a shell and run populate method that we’ve just written:

$ docker-compose run --rm web python manage.py shell
>>> from talks.models import Talk
>>> Talk.populate()

If the script worked without any errors, your database should be filled with data.

Indexing data into Elasticsearch

Now let’s define an Elasticsearch index and TalkDocument class:

# talks/documents.py
from django_elasticsearch_dsl import DocType, Index
from .models import Talk talks = Index('talks')
talks.settings(number_of_shards=1, number_of_replicas=0) @talks.doc_type
class TalkDocument(DocType): class Meta: # The model associated with Elasticsearch document model = Talk # The fields of the model you want to be indexed # in Elasticsearch fields = ( 'name', 'description', 'speaker', 'number_of_views', 'transcript', )

This class basically connects our relational database with Elasticsearch. In order to create an index talks and index data from the relational database into Elasticsearch, we should run the management command that comes from the library django_elasticsearch_dsl:

$ docker-compose run --rm web python manage.py search_index --rebuild

If you try to make a request to Elasticsearch now, you will see the same data we had in our relational database except for url field because we haven’t indexed it:

GET /talks/doc/_search
{ "took": 8, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": 2550, "max_score": 1.0, "hits": [ { "_index": "talks", "_type": "doc", "_id": "1", "_score": 1.0, "_source": { "name": "Ken Robinson: Do schools kill creativity?", "description": "Sir Ken Robinson makes an entertaining and profoundly moving case for creating an education system that nurtures (rather than undermines) creativity.", "speaker": "Ken Robinson", "number_of_views": 47227110, "transcript": "Good morning. How are you?(Laughter)It's been great, hasn't it? I've been blowreakespeare having a father, do you? Do you? Because you don't think of Shakespeare beirow up, we start to ..." } }, ... { "_index": "talks", "_type": "doc", "_id": "10", "_score": 1.0, "_source": { "name": "Rick Warren: A life of purpose", "description": "Pastor Rick Warren, author of \\"The Purpose-Driven Life,\\" reflects on his own crisis of purpose in the wake of his book's wild success. He explains his belief that God's intention is for each of us to use our talents and influence to do good.", "speaker": "Rick Warren", "number_of_views": 3095993, "transcript": "I'm often asked, \\"What surprised you about the book?\\" And I say, \\"That I got to ..." } } ] }
}

Searching for talks

Let’s create a function that searches for relevant talks. It will analyze name, description, speaker and transcript fields and also boost talks that have the highest number of views:

# talks/search.py
from elasticsearch_dsl.query import Q, MultiMatch, SF
from .documents import TalkDocument def get_search_query(phrase): query = Q( 'function_score', query=MultiMatch( fields=['name', 'description', 'speaker', 'transcript'], query=phrase ), functions=[ SF('field_value_factor', field='number_of_views') ] ) return TalkDocument.search().query(query) def search(phrase): return get_search_query(phrase).to_queryset()

You can play around with this function from the shell:

$ docker-compose run --rm web python manage.py shell
>>> from talks.search import search
>>> talks = search('elon musk')
>>> talks
<QuerySet [<Talk: Elon Musk: The future we're building -- and boring>, <Talk: Elon Musk: The mind behind Tesla, SpaceX, SolarCity ...>, <Talk: Adam Grant: The surprising habits of original thinkers>, <Talk: Stephen Petranek: Your kids might live on Mars. Here's how they'll survive>, <Talk: Larry Page: Where’s Google going next?>, <Talk: Grady Booch: Don't fear superintelligent AI>, <Talk: Jim Simons: The mathematician who cracked Wall Street>, <Talk: Naomi Oreskes: Why we should trust scientists>, <Talk: Stuart Russell: 3 principles for creating safer AI>, <Talk: Burt Rutan: The real future of space exploration>]>

Creating REST API endpoint

Now let’s create a simple API using Django REST Framework. We will start using it later when we start creating a web page.

First define a serializer:

# talks/api/serializers.py
from rest_framework import serializers
from ..models import Talk class TalkSerializer(serializers.ModelSerializer): class Meta: model = Talk fields = ( 'name', 'description', 'speaker', 'url', 'number_of_views', 'transcript', )

Create a view:

# talks/api/views.py
from rest_framework import generics
from ..models import Talk
from ..search import search
from .serializers import TalkSerializer class TalkList(generics.ListAPIView): queryset = Talk.objects.all() serializer_class = TalkSerializer def get_queryset(self): q = self.request.query_params.get('q') if q is not None: return search(q) return super().get_queryset()

Also don’t forget to add a new url to the list of urlpatterns:

# ted/urls.py
...
from talks.api.views import TalkList urlpatterns = [ ... path('api/v1/talks/', TalkList.as_view()),
]

Creating a page

And the last step is to create a page. I will use Vue.js in order to dynamically refresh a list of talks without refreshing the page. But you can use whatever you want. Even write everything in pure Javascript. I picked Vue.js because it’s simple and very easy to use.

Let’s add a view:

# talks/views.py
from django.shortcuts import render def talk_list(request): return render(request, 'talks/talk_list.html')

Create a template:

<!-- talks/templates/talks.talk_list.html -->
<!DOCTYPE html>
<html lang="en">
<head> <meta charset="UTF-8"> <title>Talk List</title> <link rel="stylesheet" href="<https://cdn.jsdelivr.net/npm/semantic-ui@2.4.2/dist/semantic.min.css>">
</head>
<body> <div id="app"> <div class="ui placeholder segment"> <div class="ui input focus"> <input v-model="query" type="text" placeholder="Search for talks..." /> </div> </div> <div class="ui three column stackable grid container"> <div v-for="talk in talks" class="column"> <a class="ui card" :href="talk.url"> <div class="content"> <div class="header">[[ talk.name ]]</div> <div class="meta">[[ talk.speaker ]]</div> <div class="description">[[ talk.description ]]</div> </div> <div class="extra content"> <i class="check icon"></i> [[ talk.number_of_views ]] Views </div> </a> </div> </div> </div> <script src="<https://unpkg.com/vue>"></script> <script src="<https://unpkg.com/lodash>"></script> <script src="<https://unpkg.com/axios/dist/axios.min.js>"></script> <script src="<https://cdn.jsdelivr.net/npm/semantic-ui@2.4.2/dist/semantic.min.js>"></script> <script> new Vue({ el: '#app', delimiters: ['[[', ']]'], data: { query: '', talks: [] }, // This hook will be executed when the instance of // Vue is created async created () { this.talks = await this.getTalks() }, methods: { // Sends a request to our API in order to get // a list of talks async getTalks () { const response = await axios.get('/api/v1/talks/', { params: { q: this.query } }) return response.data.results } }, watch: { // This function will be executed every time // the user changes `query`. // Using debounce from lodash library here allows us to // delay sending a request to an API until // the user has stopped changing `query`. // Stopped typing, basically. query: _.debounce(async function () { this.talks = await this.getTalks() }, 500) } }) </script>
</body>
</html>

And update urlpatterns:

# ted/urls.py
...
from talks.views import talk_list urlpatterns = [ ... path('talks/', talk_list),
]

Conclusion

In the article, we’ve covered Elasticsearch from a high level, but it has a lot of features that are beyond the scope of this single article.

For further learning, you can read documentation, or even better, try to build something yourself. Have fun!