Statsbomb

mplsoccer contains functions to return StatsBomb data in a flat, tidy dataframe.

Please be responsible with Statsbomb data. Register your details and read the user agreement carefully (on the same page).

It can be used with the StatBomb open-data or the StatsBomb API if you are lucky enough to have access:

# this only works if you have access to the StatsBomb API
import requests
from mplsoccer.statsbomb import EVENT_SLUG, read_event
username = 'CHANGEME'
password = 'CHANGEME'
auth = requests.auth.HTTPBasicAuth(username, password)
URL = 'CHANGEME'
response = requests.get(URL, auth=auth)
df_dict = read_event(response)

Here are some alternatives to mplsoccer’s statsbomb module:

import glob
import os

import numpy as np
import pandas as pd

import mplsoccer.statsbomb as sbapi

Competition data

Get the competition data as a dataframe as save as parquet file

df_competition = sbapi.read_competition(sbapi.COMPETITION_URL, warn=False)
df_competition.info()

Out:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   competition_id             40 non-null     int64
 1   season_id                  40 non-null     int64
 2   country_name               40 non-null     object
 3   competition_name           40 non-null     object
 4   competition_gender         40 non-null     object
 5   competition_youth          40 non-null     bool
 6   competition_international  40 non-null     bool
 7   season_name                40 non-null     object
 8   match_updated              40 non-null     datetime64[ns]
 9   match_updated_360          40 non-null     object
 10  match_available_360        2 non-null      object
 11  match_available            40 non-null     datetime64[ns]
dtypes: bool(2), datetime64[ns](2), int64(2), object(6)
memory usage: 3.3+ KB

Match data

Get the match data as a dataframe Note there is a mismatch between the length of this file and the number of event files because some event files don’t have match data.

match_links = sbapi.get_match_links()
match_dfs = [sbapi.read_match(file, warn=False) for file in match_links]
df_match = pd.concat(match_dfs)
df_match.info()

Out:

Skipping https://raw.githubusercontent.com/statsbomb/open-data/master/data/matches/16/42.json: empty json
Skipping https://raw.githubusercontent.com/statsbomb/open-data/master/data/matches/16/76.json: empty json
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1096 entries, 0 to 51
Data columns (total 50 columns):
 #   Column                           Non-Null Count  Dtype
---  ------                           --------------  -----
 0   match_id                         1096 non-null   int64
 1   match_date                       1096 non-null   datetime64[ns]
 2   kick_off                         1094 non-null   datetime64[ns]
 3   home_score                       1096 non-null   int64
 4   away_score                       1096 non-null   int64
 5   match_status_360                 1096 non-null   object
 6   last_updated                     1096 non-null   datetime64[ns]
 7   last_updated_360                 1096 non-null   object
 8   match_week                       1096 non-null   int64
 9   competition_id                   1096 non-null   int64
 10  competition_country_name         1096 non-null   object
 11  competition_name                 1096 non-null   object
 12  season_id                        1096 non-null   int64
 13  season_name                      1096 non-null   object
 14  home_team_id                     1096 non-null   int64
 15  home_team_name                   1096 non-null   object
 16  competition_gender               1096 non-null   object
 17  home_team_group                  124 non-null    object
 18  home_team_country_id             1096 non-null   int64
 19  home_team_country_name           1096 non-null   object
 20  away_team_id                     1096 non-null   int64
 21  away_team_name                   1096 non-null   object
 22  away_team_group                  124 non-null    object
 23  away_team_country_id             1096 non-null   int64
 24  away_team_country_name           1096 non-null   object
 25  metadata_data_version            1096 non-null   object
 26  metadata_shot_fidelity_version   900 non-null    object
 27  metadata_xy_fidelity_version     819 non-null    object
 28  competition_stage_id             1096 non-null   int64
 29  competition_stage_name           1096 non-null   object
 30  stadium_id                       1096 non-null   int64
 31  stadium_name                     1096 non-null   object
 32  stadium_country_id               1096 non-null   int64
 33  stadium_country_name             1096 non-null   object
 34  referee_id                       929 non-null    float64
 35  referee_name                     929 non-null    object
 36  referee_country_id               929 non-null    float64
 37  referee_country_name             929 non-null    object
 38  home_team_managers_id            1044 non-null   float64
 39  home_team_managers_name          1044 non-null   object
 40  home_team_managers_nickname      474 non-null    object
 41  home_team_managers_dob           1039 non-null   datetime64[ns]
 42  home_team_managers_country_id    1044 non-null   float64
 43  home_team_managers_country_name  1044 non-null   object
 44  away_team_managers_id            1043 non-null   float64
 45  away_team_managers_name          1043 non-null   object
 46  away_team_managers_nickname      484 non-null    object
 47  away_team_managers_dob           1037 non-null   datetime64[ns]
 48  away_team_managers_country_id    1043 non-null   float64
 49  away_team_managers_country_name  1043 non-null   object
dtypes: datetime64[ns](5), float64(6), int64(13), object(26)
memory usage: 436.7+ KB

Lineup data

df_lineup = sbapi.read_lineup(f'{sbapi.LINEUP_SLUG}/7478.json', warn=False)
df_lineup.info()

Out:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   team_id               26 non-null     int64
 1   team_name             26 non-null     object
 2   match_id              26 non-null     int64
 3   player_id             26 non-null     int64
 4   player_name           26 non-null     object
 5   player_nickname       1 non-null      object
 6   player_jersey_number  26 non-null     int64
 7   player_cards          26 non-null     object
 8   player_positions      26 non-null     object
 9   player_country_id     26 non-null     int64
 10  player_country_name   26 non-null     object
dtypes: int64(5), object(6)
memory usage: 2.4+ KB

Event data

dict_event = sbapi.read_event(f'{sbapi.EVENT_SLUG}/7478.json', warn=False)
df_event = dict_event['event']
df_related_event = dict_event['related_event']
df_shot_freeze = dict_event['shot_freeze_frame']
df_tactics_lineup = dict_event['tactics_lineup']

# exploring the data
df_event.info()
df_related_event.info()
df_shot_freeze.info()
df_tactics_lineup.info()

Out:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3381 entries, 0 to 3380
Data columns (total 73 columns):
 #   Column                          Non-Null Count  Dtype
---  ------                          --------------  -----
 0   match_id                        3381 non-null   int64
 1   id                              3381 non-null   object
 2   index                           3381 non-null   int64
 3   period                          3381 non-null   int64
 4   timestamp_minute                3381 non-null   int64
 5   timestamp_second                3381 non-null   int64
 6   timestamp_millisecond           3381 non-null   int64
 7   minute                          3381 non-null   int64
 8   second                          3381 non-null   int64
 9   type_id                         3381 non-null   int64
 10  type_name                       3381 non-null   object
 11  sub_type_id                     421 non-null    float64
 12  sub_type_name                   421 non-null    object
 13  outcome_id                      612 non-null    float64
 14  outcome_name                    612 non-null    object
 15  play_pattern_id                 3381 non-null   int64
 16  play_pattern_name               3381 non-null   object
 17  possession_team_id              3381 non-null   int64
 18  possession                      3381 non-null   int64
 19  possession_team_name            3381 non-null   object
 20  team_id                         3381 non-null   int64
 21  team_name                       3381 non-null   object
 22  player_id                       3342 non-null   float64
 23  player_name                     3342 non-null   object
 24  position_id                     3342 non-null   float64
 25  position_name                   3342 non-null   object
 26  duration                        2583 non-null   float64
 27  x                               3335 non-null   float64
 28  y                               3335 non-null   float64
 29  z                               0 non-null      float64
 30  end_x                           1844 non-null   float64
 31  end_y                           1844 non-null   float64
 32  end_z                           19 non-null     float64
 33  body_part_id                    963 non-null    float64
 34  body_part_name                  963 non-null    object
 35  technique_id                    36 non-null     float64
 36  technique_name                  36 non-null     object
 37  under_pressure                  607 non-null    float64
 38  counterpress                    97 non-null     float64
 39  pass_length                     1018 non-null   float64
 40  pass_angle                      1018 non-null   float64
 41  pass_recipient_id               775 non-null    float64
 42  pass_recipient_name             775 non-null    object
 43  pass_height_id                  1018 non-null   float64
 44  pass_height_name                1018 non-null   object
 45  pass_switch                     29 non-null     object
 46  pass_assisted_shot_id           17 non-null     object
 47  pass_goal_assist                3 non-null      object
 48  pass_cross                      27 non-null     object
 49  pass_backheel                   5 non-null      object
 50  pass_shot_assist                14 non-null     object
 51  bad_behaviour_card_id           1 non-null      float64
 52  bad_behaviour_card_name         1 non-null      object
 53  ball_recovery_recovery_failure  10 non-null     object
 54  ball_recovery_offensive         1 non-null      object
 55  block_offensive                 2 non-null      object
 56  dribble_overrun                 2 non-null      object
 57  dribble_nutmeg                  1 non-null      object
 58  foul_committed_penalty          1 non-null      object
 59  foul_committed_offensive        2 non-null      object
 60  foul_committed_card_id          1 non-null      float64
 61  foul_committed_card_name        1 non-null      object
 62  foul_won_defensive              7 non-null      object
 63  foul_won_penalty                1 non-null      object
 64  goalkeeper_position_id          22 non-null     float64
 65  goalkeeper_position_name        22 non-null     object
 66  shot_statsbomb_xg               25 non-null     float64
 67  shot_key_pass_id                17 non-null     object
 68  shot_one_on_one                 4 non-null      object
 69  substitution_replacement_id     4 non-null      float64
 70  substitution_replacement_name   4 non-null      object
 71  tactics_formation               3 non-null      float64
 72  aerial_won                      27 non-null     object
dtypes: float64(25), int64(13), object(35)
memory usage: 1.9+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6426 entries, 0 to 6425
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   id                 6426 non-null   object
 1   id_related         6426 non-null   object
 2   type_name          6426 non-null   object
 3   index              6426 non-null   int64
 4   type_name_related  6426 non-null   object
 5   index_related      6426 non-null   int64
 6   match_id           6426 non-null   int64
dtypes: int64(3), object(4)
memory usage: 401.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279 entries, 0 to 278
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   id                    279 non-null    object
 1   event_freeze_id       279 non-null    object
 2   player_teammate       279 non-null    bool
 3   player_id             279 non-null    int64
 4   player_name           279 non-null    object
 5   player_position_id    279 non-null    int64
 6   player_position_name  279 non-null    object
 7   x                     279 non-null    float64
 8   y                     279 non-null    float64
 9   match_id              279 non-null    int64
dtypes: bool(1), float64(2), int64(3), object(4)
memory usage: 20.0+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   id                    33 non-null     object
 1   event_tactics_id      33 non-null     object
 2   player_jersey_number  33 non-null     int64
 3   player_id             33 non-null     int64
 4   player_name           33 non-null     object
 5   player_position_id    33 non-null     int64
 6   player_position_name  33 non-null     object
 7   match_id              33 non-null     int64
dtypes: int64(4), object(4)
memory usage: 2.2+ KB

Total running time of the script: ( 0 minutes 10.880 seconds)

Gallery generated by Sphinx-Gallery