From 43f7197c377a8e22740992628e3848bc8f5160f8 Mon Sep 17 00:00:00 2001 From: Matthias Kempe <60000189+MatKempeGroningen@users.noreply.github.com> Date: Mon, 19 Jan 2026 15:09:33 +0100 Subject: [PATCH 1/2] Revise summary and statement of need for clarity Updated the summary and statement of need sections to enhance clarity and correct typographical errors. Improved descriptions of soccer analytics and data parsing functionalities. --- joss-paper/paper.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/joss-paper/paper.md b/joss-paper/paper.md index 31d7f89..c114842 100644 --- a/joss-paper/paper.md +++ b/joss-paper/paper.md @@ -35,16 +35,16 @@ bibliography: paper.bib # Summary -Over the last decade, there has been a growing interest in soccer analysis from different backgrounds. First, practical decisions and benchmarks are influenced by aggregated metrics such as pass success percentage and expected goals (xG) [@Goes2020a]. Second, internal and external load metrics are used for periodization and are related to injury predictions [@Hader2019]. Third, soccer is very dynamic, but still constrained enough to use it to study individual, subgroup, and group behaviour [@Goes2020b]. The interest in soccer analysis has increased because data has become more openly available [Bassek2025]. However, a key challenge is that every data provider uses their own data format, which makes it hard to compare and switch between different providers. Currently, open-source packages like [Kloppy](https://kloppy.pysport.org) try to overcome this challenge by providing a uniform data format. Similarly, the scientific side proposes a common data format for soccer game data [@Anzer2025]. While Kloppy focuses primarily on parsing soccer data, Floodlight [@Raabe2022] delivers a framework for physical analysis of team sports, and [mplsoccer](https://github.com/andrewRowlinson/mplsoccer) is widely utilized for visualising soccer data. +Over the last decade, there has been a growing interest in soccer analytics from different backgrounds and for differnt use cases. Examplary use cases are : first, practical decision making and benchmarking of players based on aggregated metrics such as pass success percentage and expected goals (xG) [@Goes2020a]. Second, using internal and external load metrics for training periodization and injury predictions [@Hader2019]. Third, basic behavioural science soccer with a focus on group and subgroup behaviour[@Goes2020b]. The interest in soccer analysis as data has become more openly available [Bassek2025]. However, a key challenge is that every data provider uses their own data format, which makes it hard to compare and switch between different providers and create large datasets that encompass differnt leagues and competitions. Currently, open-source packages like [Kloppy](https://kloppy.pysport.org) try to overcome this challenge by providing a uniform data format. Similarly, the scientific side proposes a common data format for soccer game data [@Anzer2025]. While Kloppy focuses primarily on parsing soccer data, Floodlight [@Raabe2022] delivers a framework for physical analysis of team sports, and [mplsoccer](https://github.com/andrewRowlinson/mplsoccer) is widely utilized for visualising soccer data. -Lately, there has been a growing interest in combining event and tracking data for contextualised tactical analysis of soccer games. This provides the possibility to not only know that a pass happened at a specific moment in the match (event data) but also what the defensive structure was during this pass [@Forcher2022; Herold2022], and what other passing options were available at this moment (tracking data) [@Spearman2017]. Contextual analysis goes beyond aggregated metrics and provides the ability to do quantitative analysis of single moments or specific phases in the game [@Oonk2025a; Jerome2024]. A key challenge here is in merging the event data and tracking data together. [`DataBallPy`](https://databallpy.readthedocs.io/en/latest/) is an open source python package for contextual analysis of soccer games because (1) it uses a standardazed data format for both event and tracking data, (2) it provides a framework where all data of a game is bundled, instead of considered as seperate data objects, (3) it includes a high quality synchronsiation algorithm that works on any combination of tracking and event data providers, and (4) it has integrated multiple practical and scientific features within the package that allow for efficient computation with minimal user input. +Lately, there has been a growing interest in combining event and tracking data for contextualised tactical analysis of soccer games. This provides the possibility to not only know that a pass happened at a specific moment in the match (event data) but also what the defensive structure was during this pass [@Forcher2022; Herold2022], and what other passing options were available at this moment (tracking data) [@Spearman2017]. Contextual analysis goes beyond aggregated metrics and provides the ability to do quantitative analysis of single moments or specific phases in the game [@Oonk2025a; Jerome2024]. A key challenge for this is in merging the event data and tracking data together. [`DataBallPy`](https://databallpy.readthedocs.io/en/latest/) is an open source python package for contextual analysis of soccer games because (1) it uses a standardazed data format for both event and tracking data, (2) it provides a framework where all data of a game is bundled, instead of considered as seperate data objects, (3) it includes a high quality synchronsiation algorithm that works on any combination of tracking and event data providers, and (4) it has integrated multiple practical and scientific features within the package that allow for efficient computation with minimal user input. # Statement of need Modern soccer analytics increasingly rely on both event data and tracking data for a comprehensive analysis. Event data captures specific information about events (e.g., passes and shots) like their location, success, start location, and the athlete involved in the action. This information on itself is primarily aggregated for tactical game and player analysis [@Goes2020a] but is also widely used in scouting because of the low cost and widespread availability of the data [@vanArem2025]. Tracking data, on the other hand, captures spatiotemporal information of all athletes and the ball at frequencies ranging between 10 and 25 Hz [@Linke2020]. This data is primarily used to quantify physical performance, but also for the detection of dynamic formation [@sotudeh2025], detection of events[@Vidal-Codina2022], detection of game phases [@Bauer2023], space occupation [@Spearman2017; Rein2017], and quantification of dangerousity [@Link2016]. -The current package allows for parsing [Kloppy](https://kloppy.pysport.org) and analysis [@Raabe2022] of either data stream independently. However, there has been a growing interest in combining event and tracking data to enrich event information with spatiotemporal context. This added context provides insights and nuances, primarily on a tactical level, that neither event nor tracking data can provide independently. For example, shot events are enriched with information about defensive and keeper positioning to create better expected goals models [@Anzer2021], passes are evaluated by making risk reward assessments of all possible passing options [@Goes2021], determinants of successful 1v1 actions are modelled from spatiotemporal features [@Oonk2025a], and the spatiotemporal context of events is used to predict dangerousity of a game state [@Fernandez2021]. A contextual analysis requires a proper synchronisation of event and tracking data, and a convenient data structure for further analysis. Current packages either have a separation between event and tracking data with limited options to combine them [@Raabe2022], or focus only on the synchronistation approach, limiting the convenient data structure to start your analysis after merging the data streams [@VanRoy2024; @Kim2025] +The currently avaiable packages allow for parsing [Kloppy](https://kloppy.pysport.org) and analysis [@Raabe2022] of either data stream independently. However, there has been a growing interest in combining event and tracking data to enrich event information with spatiotemporal context. This added context provides insights and nuances, primarily on a tactical level, that neither event nor tracking data can provide independently. For example, shot events are enriched with information about defensive and keeper positioning to create better expected goals models [@Anzer2021], passes are evaluated by making risk reward assessments of all possible passing options [@Goes2021], determinants of successful 1v1 actions are modelled from spatiotemporal features [@Oonk2025a], and the spatiotemporal context of events is used to predict dangerousity of a game state [@Fernandez2021]. A contextual analysis requires a proper synchronisation of event and tracking data, and a convenient data structure for further analysis. Current packages either have a separation between event and tracking data with limited options to combine them [@Raabe2022], or focus only on the synchronistation approach, limiting the convenient data structure to start your analysis after merging the data streams [@VanRoy2024; @Kim2025] `DataBallPy` addresses this gap by combining all game-related data in a standardized `Game` object. The `Game` object includes event, tracking, and metadata. The primary feature of `DataBallPy` is the robust and efficient synchronistation between event and tracking data. Although event and tracking data often both provide timestamps, their alignment has shown to be extremely poor with reported errors of 1.82 (+-4.06) seconds [@Anzer2021]. Especially, the random error is concerning since it does not allow for easy correction, and within 4 seconds, the game might have evolved to an entirely different situation. Although specific approaches have been introduced to solve this problem, they can take between 3 and 10 minutes per game of runtime, may skip certain events, and potentially shuffle the order of events [@VanRoy2024; @Kim2025]. `DataBallPy` allows for a state of the art synchronisation algorithm that ensures the synchronisation of all events in the right order within a few seconds [@Oonk2025b] in just one line of code. They showed that the expected goals model decreased in Brier loss from 0.096 to 0.082 (lower is better) when using the synchronisation in `DataBallPy` compared to a naive timestamp synchronisation. Similarly, the feature importance of features that relied on combined tracking and event data information was close to 0 in the timestamp synchronisation model, which was not the case for the `DataBallPy` synchronisation model [@Oonk2025b]. @@ -57,11 +57,11 @@ The features and functionalities in `DataBallPy` can be categorised into five ca ## Parsing Data -The core goal of parsing data in `DataBallPy` is obtaining a `Game` object. `DataBallPy` allows for parsing data from different commercial data providers such as Tracab, Metrica, Inmotio, Opta, Instat, SciSports, Sportec, and Statsbomb internally using the `get_game` function. The `Game` object contains the event and tracking data internally as Pandas dataframes, making them intuitive to work with [@reback2020pandas]. Alternatively, one can use [Kloppy](https://kloppy.pysport.org/) to parse data from more providers and use the `get_game_from_kloppy` function to transform the Kloppy event and tracking datasets into a `Game` object. Last, `DataBallPy` has included a function to load openly available data directly in a `Game` object using `get_open_game`, which allows users who do not have access to data to still work with soccer data in `DataBallPy` [@Bassek2025]. Since parsing and the analytical pipeline of soccer data takes time and resources, `DataBallPy` can also save your processed `Game` object. Normally, raw tracking and event data together can take up to 400 MB per game. 'DataBallPy' downscales this to less than 20 MB per game and can be reloaded by using the `get_saved_game` function. +The core goal of parsing data in `DataBallPy` is obtaining a `Game` object. `DataBallPy` allows for parsing data from different commercial data providers such as Tracab, Metrica, Inmotio, Opta, Instat, SciSports, Sportec, and Statsbomb internally using the `get_game` function. The `Game` object contains the event and tracking data internally as Pandas dataframes, making them intuitive to work with [@reback2020pandas]. Alternatively, one can use [Kloppy](https://kloppy.pysport.org/) to parse data from differnt providers and use the `get_game_from_kloppy` function to transform the Kloppy event and tracking datasets into a `Game` object. Last, `DataBallPy` has included a function to load openly available data directly in a `Game` object using `get_open_game`, which allows users who do not have access to data to still work with soccer data in `DataBallPy` [@Bassek2025]. Since parsing and the analytical pipeline of soccer data takes time and resources, `DataBallPy` can also save your processed `Game` object. Normally, raw tracking and event data together can take up to 400 MB per game. 'DataBallPy' downscales this to less than 20 MB per game. The games saved in this format can be reloaded by using the `get_saved_game` function. ## Preprocessing -Tracking data is often captured via video footage. Depending on the quality and number of cameras, some noise is present in both the athlete and ball positions [@Linke2020]. `DataBallPy` allows for filtering of the tracking data, differentiation of positions to compute velocity and acceleration. Furthermore, the tracking data allows for computation of individual athlete possession [@Vidal-Codina2022], and together with the event data, team-level possession can be estimated. Since the combination of parsing and preprocessing a single game of data can take anywhere between 30 seconds and a few minutes on a standard device (which is similar to other packages), `DataBallPy` also allows you to efficiently save the preprocessed `Game` object as parquet and JSON files. This has two main benefits. First, using the `get_saved_game` function, you can now obtain a preprocessed game object in milliseconds instead of minutes, and second, raw tracking data files can be up to 400 MB per game, while the saved `DataBallPy` Game objects that include both event and tracking data are generally between 20 and 100 MB of memory. +Tracking data is often captured via video footage using computer vision. Depending on the quality and number of cameras, some noise is present in both the athlete and ball positions [@Linke2020]. `DataBallPy` allows for filtering of the tracking data, differentiation of positions to compute velocity and acceleration. Furthermore, the tracking data allows for computation of individual athlete possession [@Vidal-Codina2022], and together with the event data, team-level possession can be estimated. Since the combination of parsing and preprocessing a single game of data can take anywhere between 30 seconds and a few minutes on a standard device (which is similar to other packages), `DataBallPy` also allows one to efficiently save the preprocessed `Game` object as parquet and JSON files. This has two main benefits. First, using the `get_saved_game` function, you can now obtain a preprocessed game object in milliseconds instead of minutes, and second, raw tracking data files can be up to 400 MB per game, while the saved `DataBallPy` Game objects that include both event and tracking data are generally between 20 and 100 MB of memory. ## Synchronisation @@ -121,4 +121,4 @@ plt.show() -# References \ No newline at end of file +# References From 4f0a31a1467a8488a8b593171a5ceabd7547307d Mon Sep 17 00:00:00 2001 From: Alek050 Date: Wed, 21 Jan 2026 12:39:36 +0100 Subject: [PATCH 2/2] Resolved typos and reformulated some sentences --- joss-paper/paper.bib | 42 ++++++++++++++++-------------------------- joss-paper/paper.md | 10 +++++----- 2 files changed, 21 insertions(+), 31 deletions(-) diff --git a/joss-paper/paper.bib b/joss-paper/paper.bib index fbf7c85..2064192 100644 --- a/joss-paper/paper.bib +++ b/joss-paper/paper.bib @@ -48,7 +48,7 @@ @article{vanArem2025 number={16}, journal={Applied Sciences}, publisher={MDPI AG}, - author={van Arem, Koen and Goes-Smit, Floris and Söhl, Jakob}, + author={Koen van Arem and Floris Goes-Smit and Jakob Söhl}, year={2025}, month=aug, pages={8916} } @@ -72,10 +72,10 @@ @article{Bassek2025 @article{Bauer2023, abstract = {Choosing the right formation is one of the coach's most important decisions in football. Teams change formation dynamically throughout matches to achieve their immediate objective: to retain possession, progress the ball up-field and create (or prevent) goal-scoring opportunities. In this work we identify the unique formations used by teams in distinct phases of play in a large sample of tracking data. This we achieve in two steps: first, we train a convolutional neural network to decompose each game into non-overlapping segments and classify these segments into phases with an average F 1-score of 0.76. We then measure and contextualize unique formations used in each distinct phase of play. While conventional discussion tends to reduce team formations over an entire match to a single three-digit code (e.g. 4-4-2; 4 defender, 4 midfielder, 2 striker), we provide an objective representation of team formations per phase of play. Using the most frequently occurring phases of play, mid-block, we identify and contextualize six unique formations. A long-term analysis in the German Bundesliga allows us to quantify the efficiency of each formation, and to present a helpful scouting tool to identify how well a coach's preferred playing style is suited to a potential club.}, - author = {P Bauer and G Anzer and L Shaw - Journal of sports analytics and undefined 2023}, + author = {Pascal Bauer and Gabriel Anzer and Lluke Shaw}, doi = {10.3233/JSA-220620}, issue = {1}, - journal = {journals.sagepub.comP Bauer, G Anzer, L ShawJournal of sports analytics, 2023•journals.sagepub.com}, + journal = {Journal of sports analytics}, keywords = {Association football,human-in-the-loop machine learning,soccer,sports analytics}, month = {3}, pages = {39-59}, @@ -99,7 +99,7 @@ @article{Bischofberger2025 @inproceedings{Fernandez2018, - author = {Javier Fernández and F C Barcelona and Javier Fernandez and Luke Bornn}, + author = {Javier Fernandez and Luke Bornn}, booktitle = {Sloan Sports Analytics Conference}, title = {Wide Open Spaces: A statistical technique for measuring space creation in professional soccer}, url = {https://www.researchgate.net/publication/324942294_Wide_Open_Spaces_A_statistical_technique_for_measuring_space_creation_in_professional_soccer}, @@ -124,9 +124,9 @@ @article{Fernandez2021 @article{Forcher2022, abstract = {Recently, the availability of big amounts of data enables analysts to dive deeper into the constraints of performance in various team sports. While offensive analyses in football have been extensively conducted, the evaluation of defensive performance is underrepresented in this sport. Hence, the aim of this study was to analyze successful defensive playing phases by investigating the space and time characteristics of defensive pressure. Therefore, tracking and event data of 153 games of the German Bundesliga (second half of 2020/21 season) were assessed. Defensive pressure was measured in the last 10 seconds of a defensive playing sequence (time characteristic) and it was distinguished between pressure on the ball-carrier, pressure on the group (5 attackers closest to the ball), and pressure on the whole team (space characteristic). A linear mixed model was applied to evaluate the effect of success of a defensive play (ball gain), space characteristic, and time characteristic on defensive pressure. Defensive pressure is higher in successful defensive plays (14.47 ± 16.82[%]) compared to unsuccessful defensive plays (12.87 ± 15.31[%]). The characteristics show that defensive pressure is higher in areas closer to the ball (space characteristic) and the closer the measurement is to the end of a defensive play (time characteristic), which is especially true for successful defensive plays. Defensive pressure is a valuable key performance indicator for defensive play. Further, this study shows that there is an association between the pressing of the ball-carrier and areas close to the ball with the success of defensive play.}, - author = {L Forcher and L Forcher and S Altmann and D Jekauc - Science and Medicine … and undefined 2022}, + author = {Leander Forcher and Leon Forcher and Stefan Altmann and Ddarko Jekauc and Matthias Kempe}, doi = {10.1080/24733938.2022.2158213}, - journal = {Taylor \& FrancisL Forcher, L Forcher, S Altmann, D Jekauc, M KempeScience and Medicine in Football, 2022•Taylor \& Francis}, + journal = {Science and Medicine in Football,}, keywords = {defensive behavior,machine learning,match analysis,performance analysis,team sports}, publisher = {Taylor and Francis Ltd.}, title = {The keys of pressing to gain the ball–Characteristics of defensive pressure in elite soccer using tracking data}, @@ -135,10 +135,10 @@ @article{Forcher2022 } @article{Goes2020a, abstract = {In professional soccer, increasing amounts of data are collected that harness great potential when it comes to analysing tactical behaviour. Unlocking this potential is difficult as big data challenges the data management and analytics methods commonly employed in sports. By joining forces with computer science, solutions to these challenges could be achieved, helping sports science to find new insights, as is happening in other scientific domains. We aim to bring multiple domains together in the context of analysing tactical behaviour in soccer using position tracking data. A systematic literature search for studies employing position tracking data to study tactical behaviour in soccer was conducted in seven electronic databases, resulting in 2338 identified studies and finally the inclusion of 73 papers. Each domain clearly contributes to the analysis of tactical behaviour, albeit in-sometimes radically-different ways. Accordingly, we present a multidisciplinary framework where each domain's contributions to feature construction, modelling and interpretation can be situated. We discuss a set of key challenges concerning the data analytics process, specifically feature construction, spatial and temporal aggregation. Moreover, we discuss how these challenges could be resolved through multidisciplinary collaboration, which is pivotal in unlocking the potential of position tracking data in sports analytics.}, - author = {F R Goes and L A Meerhoff and M J O Bueno and D M Rodrigues and F A Moura and M S Brink and M T Elferink-Gemser and A J Knobbe and S A Cunha and R S Torres and K A P M Lemmink}, + author = {Floris Goes and L A Meerhoff and M J O Bueno and D M Rodrigues and F A Moura and M S Brink and M T Elferink-Gemser and A J Knobbe and S A Cunha and R S Torres and K A P M Lemmink}, doi = {10.1080/17461391.2020.1747552}, issue = {4}, - journal = {Taylor \& FrancisFR Goes, LA Meerhoff, MJO Bueno, DM Rodrigues, FA Moura, MS BrinkEuropean Journal of Sport Science, 2021•Taylor \& Francis}, + journal = {European Journal of Sport Science}, keywords = {Football,big data,performance analysis,tactical analysis,team sport}, pages = {481-496}, publisher = {Taylor and Francis Ltd.}, @@ -149,11 +149,11 @@ @article{Goes2020a } @article{Goes2020b, abstract = {Association football teams can be considered complex dynamical systems of individuals grouped in subgroups (defenders, midfielders and attackers), coordinating their behaviour to achieve a shared g...}, - author = {Floris R Goes and Michel S Brink and Marije T Elferink-Gemser and Matthias Kempe and Koen A P M Lemmink}, + author = {Floris Goes and Michel Brink and Marije Elferink-Gemser and Matthias Kempe and Koen A P M Lemmink}, doi = {10.1080/02640414.2020.1834689}, issn = {1466447X}, issue = {5}, - journal = {https://doi.org/10.1080/02640414.2020.1834689}, + journal = {Journal of Sports Sciences}, keywords = {Soccer,Spatiotemporal,machine learning,subgroups,tactics}, pages = {523-532}, pmid = {33106106}, @@ -168,7 +168,7 @@ @article{Goes2021 author = {Floris Goes and Edgar Schwarz and Marije Elferink-Gemser and Koen Lemmink and Michel Brink}, doi = {10.1080/24733938.2021.1944660}, issue = {3}, - journal = {Taylor \& Francis}, + journal = {Science and Medicine in Football}, keywords = {football,risk-taking behaviour,spatiotemporal behaviour,tactical behaviour,time-motion analysis}, pages = {372-380}, publisher = {Taylor and Francis Ltd.}, @@ -199,7 +199,7 @@ @article{Hader2019 @article{Herold2022, abstract = {This study describes an approach to evaluate the off-ball behaviour of attacking players in association football. The aim was to implement a defensive pressure model to examine an offensive player’s ability to create separation from a defender using 1411 high-intensity off-ball actions including 988 Deep Runs (DRs) DRs and 423 Change of Directions (CODs). Twenty-two official matches (14 competitive matches and 8 friendlies) of the German National Team were included in the research. To validate the effectiveness of the pressure model, each pass (n = 25,418) was evaluated for defensive pressure on the receiver at the moment of the pass and for the pass completion rate (R = −.34, p < .001). Next, after assessing the inter-rater reliability (Fleiss Kappa of 80 for DRs and 78 for CODs), three expert raters annotated all DRs and CODs that met the pre-set criteria. A time-series analysis of each DR and COD was calculated to the nearest 0.1 second, finding a slight increase in pressure from the start to the end of the off-ball actions as defenders re-established proximity to the attacker after separation was created. A linear mixed model using run type (DR or COD) as a fixed effect with the local maximum as a fixed effect on a continuous scale resulted in p < 0.001, d = 4.81, CI = 0.63 to 0.67 for the greatest decrease in pressure, p < 0.001, d = 0.143, CI = 9.18 to 10.61 for length of the longest decrease in pressure, and p < 0.001, d = 1.13, CI = 0.90 to 1.11 for the fastest rate of decrease in pressure. As these values pertain to the local maximum, situations with greater starting pressure on the attacker often led to greater subsequent decreases. Furthermore, there was a significant (p < .0001) difference between offensive and defensive positions and the number of off-ball actions. Results suggest the model can be applied to quantify and visualise the pressure exerted on non-ball-possessing players. This approach can be combined with other methods of match analysis, providing practitioners with new opportunities to measure tactical performance in football.}, - author = {Mat Herold and A Hecksteden and D Radke and F Goes and S Nopp and T Meyer and M Kempe}, + author = {Mat Herold and Anne Hecksteden and D Radke and F Goes and S Nopp and T Meyer and M Kempe}, doi = {10.1080/02640414.2022.2081405}, issn = {1466447X}, issue = {12}, @@ -280,7 +280,7 @@ @article{Link2016 @article{Oonk2025a, abstract = {The field of football (soccer) has seen a recent increase in the utilisation of data, mainly for the analysis of physical and tactical performance. Analysis of tactical performance can be conducted...}, - author = {G. A. Oonk and T. J.W. Buurke and K. A.P.M. Lemmink and M. Kempe}, + author = {Gerard Alexander Oonk and Tom J.W. Buurke and Koen A.P.M. Lemmink and Matthias Kempe}, doi = {10.1080/02640414.2025.2555117}, issn = {1466447X}, journal = {Journal of Sports Sciences}, @@ -293,18 +293,8 @@ @article{Oonk2025a } @inproceedings{Oonk2025b, - abstract = {Valuable new insights can be obtained by combining tracking and event data in soccer anal- -ysis. However, how to synchronize the two data streams, is rarely discussed. Non systematic -errors in the timestamps, and synchronizing with cost functions result in suboptimal synchro- -nization, which hinders further analysis. Within this proceedings we will introduce a com- -putationally optimized implementation of the Needleman-Wunch algorithm, by using domain -knowledge about the game. The optimized version is over 70 times more efficient in terms of -time constraints and memory usage. On top of that, we show that the properly synchronized -approach translates back to practice with better performing xG models. Taken together, this im- -plementation is a training-free, high-quality synchronization algorithm, with low computational -cost that solves existing issues. On top of that, all data and code used for this proceedings is -fully open-sourced and available in the DataBallPy package.}, - author = {G.A. Oonk and D. Grob and M. Kempe}, + abstract = {Valuable new insights can be obtained by combining tracking and event data in soccer anal-ysis. However, how to synchronize the two data streams, is rarely discussed. Non systematicerrors in the timestamps, and synchronizing with cost functions result in suboptimal synchronization, which hinders further analysis. Within this proceedings we will introduce a computationally optimized implementation of the Needleman-Wunch algorithm, by using domain knowledge about the game. The optimized version is over 70 times more efficient in terms of time constraints and memory usage. On top of that, we show that the properly synchronized approach translates back to practice with better performing xG models. Taken together, this implementation is a training-free, high-quality synchronization algorithm, with low computationalcost that solves existing issues. On top of that, all data and code used for this proceedings is fully open-sourced and available in the DataBallPy package.}, + author = {Gerard Alexander Oonk and Daan Grob and Matthias Kempe}, city = {Luxembourg}, editor = {D. Goossens}, booktitle = {MathSports Conference}, @@ -322,7 +312,7 @@ @article{Raabe2022 volume = {7}, number = {76}, pages = {4588}, -author = {Raabe, Dominik and Biermann, Henrik and Bassek, Manuel and Wohlan, Martin and Komitova, Rumena and Rein, Robert and Groot, Tobias Kuppens and Memmert, Daniel}, +author = {Dominik Raabe and Henrik Biermann and Manuel Bassek and Martin Wohlan and Rumena Komitova and Robert Rein and Tobias Kuppens Groot and Daniel Memmert}, title = {floodlight - A high-level, data-driven sports analytics framework}, journal = {Journal of Open Source Software} } diff --git a/joss-paper/paper.md b/joss-paper/paper.md index c114842..d367066 100644 --- a/joss-paper/paper.md +++ b/joss-paper/paper.md @@ -35,16 +35,16 @@ bibliography: paper.bib # Summary -Over the last decade, there has been a growing interest in soccer analytics from different backgrounds and for differnt use cases. Examplary use cases are : first, practical decision making and benchmarking of players based on aggregated metrics such as pass success percentage and expected goals (xG) [@Goes2020a]. Second, using internal and external load metrics for training periodization and injury predictions [@Hader2019]. Third, basic behavioural science soccer with a focus on group and subgroup behaviour[@Goes2020b]. The interest in soccer analysis as data has become more openly available [Bassek2025]. However, a key challenge is that every data provider uses their own data format, which makes it hard to compare and switch between different providers and create large datasets that encompass differnt leagues and competitions. Currently, open-source packages like [Kloppy](https://kloppy.pysport.org) try to overcome this challenge by providing a uniform data format. Similarly, the scientific side proposes a common data format for soccer game data [@Anzer2025]. While Kloppy focuses primarily on parsing soccer data, Floodlight [@Raabe2022] delivers a framework for physical analysis of team sports, and [mplsoccer](https://github.com/andrewRowlinson/mplsoccer) is widely utilized for visualising soccer data. +Over the last decade, there has been a growing interest in soccer analytics from different backgrounds and for different use cases. Examplary use cases are: first, practical decision making and benchmarking of players based on aggregated metrics such as pass success percentage and expected goals (xG) [@Goes2020a]. Second, using internal and external load metrics for training periodization and injury predictions [@Hader2019]. Third, basic behavioural science soccer with a focus on group and subgroup behaviour[@Goes2020b]. The interest in soccer analysis has also increased since data has become more openly available [Bassek2025]. However, a key challenge is that every data provider uses their own data format, which makes it hard to compare and switch between different providers and create large datasets that encompass different leagues and competitions. Currently, open-source packages like [Kloppy](https://kloppy.pysport.org) try to overcome this challenge by providing a uniform data format. Similarly, the scientific side proposes a common data format for soccer game data [@Anzer2025]. While Kloppy focuses primarily on parsing soccer data, Floodlight [@Raabe2022] delivers a framework for physical analysis of team sports, and [mplsoccer](https://github.com/andrewRowlinson/mplsoccer) is widely utilized for visualising soccer data. -Lately, there has been a growing interest in combining event and tracking data for contextualised tactical analysis of soccer games. This provides the possibility to not only know that a pass happened at a specific moment in the match (event data) but also what the defensive structure was during this pass [@Forcher2022; Herold2022], and what other passing options were available at this moment (tracking data) [@Spearman2017]. Contextual analysis goes beyond aggregated metrics and provides the ability to do quantitative analysis of single moments or specific phases in the game [@Oonk2025a; Jerome2024]. A key challenge for this is in merging the event data and tracking data together. [`DataBallPy`](https://databallpy.readthedocs.io/en/latest/) is an open source python package for contextual analysis of soccer games because (1) it uses a standardazed data format for both event and tracking data, (2) it provides a framework where all data of a game is bundled, instead of considered as seperate data objects, (3) it includes a high quality synchronsiation algorithm that works on any combination of tracking and event data providers, and (4) it has integrated multiple practical and scientific features within the package that allow for efficient computation with minimal user input. +Lately, there has been a growing interest in combining event and tracking data for contextualised tactical analysis of soccer games. This provides the possibility to not only know that a pass happened at a specific moment in the match (event data) but also what the defensive structure was during this pass [@Forcher2022; Herold2022], and what other passing options were available at this moment (tracking data) [@Spearman2017]. Contextual analysis goes beyond aggregated metrics and provides the ability to do quantitative analysis of single moments or specific phases in the game [@Oonk2025a; @Jerome2024]. Merging tracking and event data is a key challenge for contextualised analysis of soccer games. [`DataBallPy`](https://databallpy.readthedocs.io/en/latest/) is an open source python package for contextual analysis of soccer games because (1) it uses a standardized data format for both event and tracking data, (2) it provides a framework where all data of a game is bundled, instead of considered as seperate data objects, (3) it includes a high quality and learning free synchronsiation algorithm that works on any combination of tracking and event data providers, and (4) it has integrated multiple practical and scientific features within the package that allow for efficient computation with minimal user input. # Statement of need Modern soccer analytics increasingly rely on both event data and tracking data for a comprehensive analysis. Event data captures specific information about events (e.g., passes and shots) like their location, success, start location, and the athlete involved in the action. This information on itself is primarily aggregated for tactical game and player analysis [@Goes2020a] but is also widely used in scouting because of the low cost and widespread availability of the data [@vanArem2025]. Tracking data, on the other hand, captures spatiotemporal information of all athletes and the ball at frequencies ranging between 10 and 25 Hz [@Linke2020]. This data is primarily used to quantify physical performance, but also for the detection of dynamic formation [@sotudeh2025], detection of events[@Vidal-Codina2022], detection of game phases [@Bauer2023], space occupation [@Spearman2017; Rein2017], and quantification of dangerousity [@Link2016]. -The currently avaiable packages allow for parsing [Kloppy](https://kloppy.pysport.org) and analysis [@Raabe2022] of either data stream independently. However, there has been a growing interest in combining event and tracking data to enrich event information with spatiotemporal context. This added context provides insights and nuances, primarily on a tactical level, that neither event nor tracking data can provide independently. For example, shot events are enriched with information about defensive and keeper positioning to create better expected goals models [@Anzer2021], passes are evaluated by making risk reward assessments of all possible passing options [@Goes2021], determinants of successful 1v1 actions are modelled from spatiotemporal features [@Oonk2025a], and the spatiotemporal context of events is used to predict dangerousity of a game state [@Fernandez2021]. A contextual analysis requires a proper synchronisation of event and tracking data, and a convenient data structure for further analysis. Current packages either have a separation between event and tracking data with limited options to combine them [@Raabe2022], or focus only on the synchronistation approach, limiting the convenient data structure to start your analysis after merging the data streams [@VanRoy2024; @Kim2025] +The currently avaiable packages allow for parsing ([Kloppy](https://kloppy.pysport.org)) and analysis [@Raabe2022] of either data stream independently. However, there has been a growing interest in combining event and tracking data to enrich event information with spatiotemporal context. This added context provides insights and nuances, primarily on a tactical level, that neither event nor tracking data can provide independently. For example, shot events are enriched with information about defensive and keeper positioning to create better expected goals models [@Anzer2021], passes are evaluated by making risk reward assessments of all possible passing options [@Goes2021], determinants of successful 1v1 actions are modelled from spatiotemporal features [@Oonk2025a], and the spatiotemporal context of events is used to predict dangerousity of a game state [@Fernandez2021]. A contextual analysis requires a proper synchronisation of event and tracking data, and a convenient data structure for further analysis. Current packages either have a separation between event and tracking data with limited options to combine them [@Raabe2022], or focus only on the synchronistation approach, limiting the convenient data structure to start your analysis after merging the data streams [@VanRoy2024; @Kim2025] `DataBallPy` addresses this gap by combining all game-related data in a standardized `Game` object. The `Game` object includes event, tracking, and metadata. The primary feature of `DataBallPy` is the robust and efficient synchronistation between event and tracking data. Although event and tracking data often both provide timestamps, their alignment has shown to be extremely poor with reported errors of 1.82 (+-4.06) seconds [@Anzer2021]. Especially, the random error is concerning since it does not allow for easy correction, and within 4 seconds, the game might have evolved to an entirely different situation. Although specific approaches have been introduced to solve this problem, they can take between 3 and 10 minutes per game of runtime, may skip certain events, and potentially shuffle the order of events [@VanRoy2024; @Kim2025]. `DataBallPy` allows for a state of the art synchronisation algorithm that ensures the synchronisation of all events in the right order within a few seconds [@Oonk2025b] in just one line of code. They showed that the expected goals model decreased in Brier loss from 0.096 to 0.082 (lower is better) when using the synchronisation in `DataBallPy` compared to a naive timestamp synchronisation. Similarly, the feature importance of features that relied on combined tracking and event data information was close to 0 in the timestamp synchronisation model, which was not the case for the `DataBallPy` synchronisation model [@Oonk2025b]. @@ -57,11 +57,11 @@ The features and functionalities in `DataBallPy` can be categorised into five ca ## Parsing Data -The core goal of parsing data in `DataBallPy` is obtaining a `Game` object. `DataBallPy` allows for parsing data from different commercial data providers such as Tracab, Metrica, Inmotio, Opta, Instat, SciSports, Sportec, and Statsbomb internally using the `get_game` function. The `Game` object contains the event and tracking data internally as Pandas dataframes, making them intuitive to work with [@reback2020pandas]. Alternatively, one can use [Kloppy](https://kloppy.pysport.org/) to parse data from differnt providers and use the `get_game_from_kloppy` function to transform the Kloppy event and tracking datasets into a `Game` object. Last, `DataBallPy` has included a function to load openly available data directly in a `Game` object using `get_open_game`, which allows users who do not have access to data to still work with soccer data in `DataBallPy` [@Bassek2025]. Since parsing and the analytical pipeline of soccer data takes time and resources, `DataBallPy` can also save your processed `Game` object. Normally, raw tracking and event data together can take up to 400 MB per game. 'DataBallPy' downscales this to less than 20 MB per game. The games saved in this format can be reloaded by using the `get_saved_game` function. +The core goal of parsing data in `DataBallPy` is obtaining a `Game` object. `DataBallPy` allows for parsing data from different commercial data providers such as Tracab, Metrica, Inmotio, Opta, Instat, SciSports, Sportec, and Statsbomb internally using the `get_game` function. The `Game` object contains the event and tracking data internally as Pandas dataframes, making them intuitive to work with [@reback2020pandas]. Alternatively, one can use [Kloppy](https://kloppy.pysport.org/) to parse data from differnt providers and use the `get_game_from_kloppy` function to transform the Kloppy event and tracking datasets into a `Game` object. Last, `DataBallPy` has included a function to load openly available data directly in a `Game` object using `get_open_game`, which allows users who do not have access to data to still work with soccer data in `DataBallPy` [@Bassek2025]. Since the combination of parsing and (pre)processing a single game of data can take anywhere between 30 seconds and a few minutes on a standard device (which is similar to other packages), `DataBallPy` also allows one to efficiently save the preprocessed `Game` object as parquet and JSON files. This has two main benefits. First, using the `get_saved_game` function, you can now obtain a preprocessed game object in milliseconds instead of minutes, and second, raw tracking data files can be up to 400 MB per game, while the saved `DataBallPy` Game objects that include both event and tracking data are generally between 20 and 100 MB of memory. ## Preprocessing -Tracking data is often captured via video footage using computer vision. Depending on the quality and number of cameras, some noise is present in both the athlete and ball positions [@Linke2020]. `DataBallPy` allows for filtering of the tracking data, differentiation of positions to compute velocity and acceleration. Furthermore, the tracking data allows for computation of individual athlete possession [@Vidal-Codina2022], and together with the event data, team-level possession can be estimated. Since the combination of parsing and preprocessing a single game of data can take anywhere between 30 seconds and a few minutes on a standard device (which is similar to other packages), `DataBallPy` also allows one to efficiently save the preprocessed `Game` object as parquet and JSON files. This has two main benefits. First, using the `get_saved_game` function, you can now obtain a preprocessed game object in milliseconds instead of minutes, and second, raw tracking data files can be up to 400 MB per game, while the saved `DataBallPy` Game objects that include both event and tracking data are generally between 20 and 100 MB of memory. +Tracking data is often captured via video footage using computer vision. Depending on the quality and number of cameras, some noise is present in both the athlete and ball positions [@Linke2020]. `DataBallPy` allows for filtering of the tracking data, differentiation of positions to compute velocity and acceleration. Furthermore, the tracking data allows for computation of individual athlete possession [@Vidal-Codina2022], and together with the event data, team-level possession can be estimated. ## Synchronisation