Analyzing Messi and Ronaldo's Games using Python and Streamlit // Adil Moujahid // Data Analytics and more

In 2003, Michael Lewis published "Moneyball"; a book about Billy Beane, the Oakland Athletics general manager who applied statistical analysis to baseball in order to identify and recruit under-valued baseball players. With the use of data, Billy Beane achieved as many wins as teams with more than double the payroll, and managed to get to the play-offs in 4 successive years from 2000 to 2003.

In 2011, Moneyball was adapted into a movie with the role of Billy Beane played by Brad Pitt. Both the book and the movie were a success and popularized the idea of using data to improve sport teams performance. The usage of data in sport is often referred as: Sports analytics.

In baseball, the nature of the sport makes it easy to collect a lot of data points about in-game action. You can download from this link a database covering in-game data points and other statistics about the players and teams going back to 1871. If you're interested in analyzing baseball data, you can find here a blog post on the topic that I wrote a few years back.

In the case of football (soccer), data collection is more complex. Football is a dynamic sport with 22 players on the pitch and unlimited possibilities of ball movement and players positioning. Fortunately in the last few years, with the advancement in sensors and video analysis; it is possible to have high quality football data that can be used to analyze football games, teams and players. In this blog post, we will be using an open collection of football logs to create a web app that analyzes Messi and Ronaldo's game during LaLiga season 2017-18 [1]. We will be using Python/Streamlit to create an interactive web app that compares both players stats and shows their positions on the pitch.

I would like to thank Luca Pappalardo and his colleagues for making this great dataset available to the public.

Below is an animated gif of the application that we will build. You can find the source code in this github repository.

Alt Text

1. Getting, Reading and Structuring the Data

Messi and Ronaldo dominated world football during the last decade with a combined 11 FIFA Ballon d'Or awards (six for Messi and five for Ronaldo). Both players are considered to be amongst the greatest players of all time and they're frequently compared to each other.

In this blog post, we will analyze the games of both players during LaLiga (Spanish League) season 2017-18. This was Ronaldo's last season in Spain before moving to Juventus.

To start with, we need to download the datasets that are introduced in the paper: A public data set of spatio-temporal match events in soccer competitions from this link. We need the following:

  • matches/matches_Spain.json: Information about LaLiga (Spanish football league) season 2017-18 matches.
  • events/events_Spain.json: All the events that occur during each match of LaLiga season 2017-18.
  • players.json: All players of the teams playing in seven national and international soccer competitions (Italian, Spanish, French, German, English first divisions, World Cup 2018, European Cup 2016).
  • teams.json: All teams in seven prominent soccer competitions (Italian, Spanish, German, French and English first divisions, World Cup 2018, European Cup 2016).
  • tags2name.csv: Mapping of tag identifiers to tag names

We start by importing the different Python libraries that we need.

In [1]:

import json
import unicodedata
import numpy as np
import pandas as pd

From players.json, we can find the player id "wyId" of both players:

  • 3359 for Messi
  • 3322 for Ronaldo

We also can find from teams.json, the team id "wyId" of both teams:

  • 676 for FC Barcelona
  • 675 for Real Madrid

Next, we read Spain matches and events datasets:

In [2]:

with open('../data/matches/matches_Spain.json') as json_file: matches_spain_data = json.load(json_file) with open('../data/events/events_Spain.json') as json_file: events_spain_data = json.load(json_file)

1.3. Structuring the Data

Structuring Messi and Ronaldo's events data

Next, we will structure all Real Madrid and FC Barcelona matches information into 2 Pandas DataFrames.

In [3]:

barca_matches = [match for match in matches_spain_data if '676' in match['teamsData'].keys()]
real_matches = [match for match in matches_spain_data if '675' in match['teamsData'].keys()]

In [4]:

barca_matches_df = pd.DataFrame(barca_matches)
real_matches_df = pd.DataFrame(real_matches)

In [5]:


Next, we will structure all Messi and Ronaldo's events data into 2 Pandas DataFrames.

In [6]:

messi_events_data = []
for event in events_spain_data: if event['playerId'] == 3359: messi_events_data.append(event) messi_events_data_df = pd.DataFrame(messi_events_data)

In [7]:

ronaldo_events_data = []
for event in events_spain_data: if event['playerId'] == 3322: ronaldo_events_data.append(event) ronaldo_events_data_df = pd.DataFrame(ronaldo_events_data)

From tags2name.csv, we select the event tags that are interesting for our analysis.

  • 101: Goal
  • 301: Assist
  • 302: key Pass
  • 401: Left Foot
  • 402: Right Foot

We add these tags as new columns in the events DataFrames.

In [8]:

def add_tag(tags, tag_id): return tag_id in [tag['id'] for tag in tags]

In [9]:



Index(['eventId', 'subEventName', 'tags', 'playerId', 'positions', 'matchId', 'eventName', 'teamId', 'matchPeriod', 'eventSec', 'subEventId', 'id'], dtype='object')

In [10]:

messi_events_data_df['goal'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 101))
messi_events_data_df['assist'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 301))
messi_events_data_df['key_pass'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 302))
messi_events_data_df['left_foot'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 401))
messi_events_data_df['right_foot'] = messi_events_data_df['tags'].apply(lambda x: add_tag(x, 402)) ronaldo_events_data_df['goal'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 101))
ronaldo_events_data_df['assist'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 301))
ronaldo_events_data_df['key_pass'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 302))
ronaldo_events_data_df['left_foot'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 401))
ronaldo_events_data_df['right_foot'] = ronaldo_events_data_df['tags'].apply(lambda x: add_tag(x, 402))

In [11]:



In [12]:

messi_events_data_df = pd.merge(messi_events_data_df, barca_matches_df, left_on='matchId', right_on='wyId', copy=False, how="left")

In [13]:

ronaldo_events_data_df = pd.merge(ronaldo_events_data_df, real_matches_df, left_on='matchId', right_on='wyId', copy=False, how="left")

In [14]:



In [15]:


Next, we will create 2 DataFrames with all Real Madrid and FC Barcelona LaLiga matches during season 2017-18 and the corresponding dates.

In [16]:

barca_matches_dates_df = barca_matches_df[['label', 'date']].copy()
real_matches_dates_df = real_matches_df[['label', 'date']].copy()

In [17]:

barca_matches_dates_df['date'] = pd.to_datetime(barca_matches_df['date'], utc=True)
real_matches_dates_df['date'] = pd.to_datetime(real_matches_df['date'], utc=True)

In [18]:

#Change date to string 
barca_matches_dates_df['date'] = barca_matches_dates_df['date'].apply(lambda x: x.strftime('%Y-%m-%d'))
real_matches_dates_df['date'] = real_matches_dates_df['date'].apply(lambda x: x.strftime('%Y-%m-%d'))

In [19]:

barca_matches_dates_df = barca_matches_dates_df.rename(columns={"label": "match"})
real_matches_dates_df = real_matches_dates_df.rename(columns={"label": "match"})

In [20]:



In [21]:


In this section, we will analyze Messi and Ronaldo's events DataFrames. We will compute a few statistics and we will see how we can plot the events on a football pitch.

Total number of events broken down by player and event type

In [22]:

goals = [messi_events_data_df['goal'].sum(), ronaldo_events_data_df['goal'].sum()]
assists = [messi_events_data_df['assist'].sum(), ronaldo_events_data_df['assist'].sum()]
shots = [messi_events_data_df[messi_events_data_df['eventName'] == 'Shot'].count()['eventName'], ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Shot'].count()['eventName']]
free_kicks = [messi_events_data_df[messi_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName'], ronaldo_events_data_df[ronaldo_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName']]
passes = [messi_events_data_df[messi_events_data_df['eventName'] == 'Pass'].count()['eventName'], ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Pass'].count()['eventName']] stats_df = pd.DataFrame([goals, assists, shots, free_kicks, passes], columns=['Messi', 'Ronaldo'], index=['Goals', 'Assists', 'Shots', 'Free Kicks', 'Passes']) print(stats_df)
 Messi Ronaldo
Goals 34 26
Assists 13 5
Shots 142 151
Free Kicks 47 15
Passes 1787 727

In [23]:

messi_lf_goals = messi_events_data_df[messi_events_data_df['left_foot'] == True]['goal'].sum()
messi_rf_goals = messi_events_data_df[messi_events_data_df['right_foot'] == True]['goal'].sum() print("Messi's goals with left foot: ", messi_lf_goals)
print("Messi's goals with right foot: ", messi_rf_goals)
Messi's goals with left foot: 32
Messi's goals with right foot: 2

In [24]:

ronaldo_lf_goals = ronaldo_events_data_df[ronaldo_events_data_df['left_foot'] == True]['goal'].sum()
ronaldo_rf_goals = ronaldo_events_data_df[ronaldo_events_data_df['right_foot'] == True]['goal'].sum()

In [25]:

print("Ronaldo's goals with left foot: ", ronaldo_lf_goals)
print("Ronaldo's goals with right foot: ", ronaldo_rf_goals)
Ronaldo's goals with left foot: 7
Ronaldo's goals with right foot: 14

For each event in messi_events_data_df and ronaldo_events_data_df, we have the origin and destination positions associated with the event. Each position is a pair of coordinates (x, y). The x and y coordinates are always in the range [0, 100] and indicate the percentage of the field from the perspective of the attacking team. [2] We will use these positions to plot the events on a football pitch.

In [26]:



0 [{'y': 50, 'x': 50}, {'y': 50, 'x': 40}]
1 [{'y': 64, 'x': 71}, {'y': 67, 'x': 54}]
2 [{'y': 62, 'x': 62}, {'y': 64, 'x': 69}]
3 [{'y': 64, 'x': 69}, {'y': 74, 'x': 83}]
4 [{'y': 74, 'x': 83}, {'y': 61, 'x': 77}]
Name: positions, dtype: object

In [27]:

from plots import *
from import output_notebook
from bokeh.plotting import figure, show

We will be using bokeh for drawing the football pitch and plot the events. I prepared 2 python functions to simplify both tasks:

  • draw_pitch(): Function to draw an empty pitch
  • plot_events(player_events, event_name, plot_color): Function that takes as input the events DataFrame, event name and a color; and plots the events on a football pitch

You can find the soure code of both functions here.

In [28]:

messi_goals = messi_events_data_df[messi_events_data_df['goal'] == True]['positions']

In [29]:

p_messi = plot_events(messi_goals, 'Goals', 'red')

Now that we understood how to read, structure and plot the data; we can start building the web app. The goal of the app is to compare the games of Messi and Ronaldo by focusing on: Goals, Assists, Shots, Free Kicks and Passes.

The app will have one tab for each event type. In each tab, we will show statistics and positions of the events; and the breakdown of events count by game. The app will also have a filter that we can use to select the events by left/right foot.

We will use an open-source app framework called Streamlit. Streamlit is a python library that can be installed using a pip install command. Streamlit is an easy to use library that allows us to create web applications using Python only and without writing HTML/JS/CSS code.

You can download the source from this github repo and you can start the application by running the following command from your terminal streamlit run and open http://localhost:8501 in your browser.

Alt Text

The First function get_data(foot) reads the pickle files and returns messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df and real_matches_dates_df DataFrames. It also filters the events by left/right foot if we pass Left or Right as parameter.

The decorator @st.cache(allow_output_mutation=True) is used to update the data whenever we call the get_data(foot) function.

def get_data(foot): . . . return messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df

Creating tabs for each event type

Streamlit is a powerful library for buidling powerful web apps and user interfaces, however the current version of the library doesn't support the creation of tabs natively. In order to add tabs to the application, we will use Bokeh and the method described here.

For each event type, we have a function that takes as input the 4 DataFrames, and for each player it draws the events positions on a football pitch and a table with the breakdown of events by game. The function combined the 2 plots and the 2 tables in a Bokeh Grid and returns the the grid as a Bokeh Panel.

def plot_goals(messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df): #Getting events data positions messi_goals = messi_events_data_df[messi_events_data_df['goal'] == True]['positions'] ronaldo_goals = ronaldo_events_data_df[ronaldo_events_data_df['goal'] == True]['positions'] #Pitch with events p_messi = plot_events(messi_goals, 'Goals', 'red') p_ronaldo = plot_events(ronaldo_goals, 'Goals', 'blue') .... grid = bokeh.layouts.grid( children=[ [p_messi, p_ronaldo], [print_table(messi_stats_df), print_table(ronaldo_stats_df)], ], sizing_mode="stretch_width", ) return bokeh.models.Panel(child=grid, title="Goals")

In the main function at the end of, you can see how the 5 functions are used to create 5 tabs for each event type.

tabs = bokeh.models.Tabs( tabs=[ plot_goals(messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df), plot_assists(messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df), plot_shots(messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df), plot_free_kicks(messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df), plot_passes(messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df), ]

In the main function, we define a streamlit Radio that we can use to filter the data by foot.

foot ="Foot", ('Either Left or Right', 'Left', 'Right'))
messi_events_data_df, ronaldo_events_data_df, barca_matches_dates_df, real_matches_dates_df = get_data(foot)

Stats of both players as Table

In the main function, we calculate the stats of both Messi and Ronaldo and display them as a dataframe using streamlit.dataframe

goals = [messi_events_data_df['goal'].sum(), ronaldo_events_data_df['goal'].sum()]
assists = [messi_events_data_df['assist'].sum(), ronaldo_events_data_df['assist'].sum()]
shots = [messi_events_data_df[messi_events_data_df['eventName'] == 'Shot'].count()['eventName'], ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Shot'].count()['eventName']]
free_kicks = [messi_events_data_df[messi_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName'], messi_events_data_df[messi_events_data_df['subEventName'] == 'Free kick shot'].count()['subEventName']]
passes = [messi_events_data_df[messi_events_data_df['eventName'] == 'Pass'].count()['eventName'], ronaldo_events_data_df[ronaldo_events_data_df['eventName'] == 'Pass'].count()['eventName']] stats_df = pd.DataFrame([goals, assists, shots, free_kicks, passes], columns=['Messi', 'Ronaldo'], index=['Goals', 'Assists', 'Shots', 'Free Kicks', 'Passes']) st.sidebar.markdown(""" ### Stats """)

In this blog post, we saw how to build a web app that analyzes Messi and Ronaldo's game during LaLiga season 2017-18 using Python and Streamlit. The dataset and the source code from this post can be adapted to implement other use cases. For example: Comparaison between other players, teams and even championships.