Scraping xG Data for Almost Any League in the World

It’s all about knowing where to search and not about writing super complex crawlers.

While russia continues to destroy my country and I have already did a lot today to help our people to fight back, it is time to switch off a little and have some fun with data.

xG metric isn’t something new in the world of football anymore, you can find it on almost every football web portal, but normally you will only have the data for top championships and it will be hard to get all the data in order to create a dataset and perform your own analysis on it.

After some research I have found a nice database with xG data for 40 different competitions starting from the season 2016 (though not all the leagues have the data that much back in time). This dataset is free and available to anyone. All you have to do is download it.

The dataset is prepared and maintained by FiveThirtyEight website that collects a lot of different data from a lot of different sources and in a lot of different fields. Luckily for us, they also collect football data (or as they call it – soccer… oh, those Americans…) and use it for predictions of matches outcomes. My friend in betting business said those predictions are quite useless. But anyway, they developed their own SPI rating and use it to rank different clubs based on a set of metrics and having those rating predict the outcomes of the future games. More about their method here.

So, to download the dataset you have to run the following code:

The file spi_matches.csv contains match-by-match SPI ratings and forecasts by FiveThirtyEight back to 2016. There are 3 more datasets available:

The beauty of this dataset is that it has the xG number for every team in each match which means we can do whatever we want with this data points.

For example, I was looking for an average xG_Against (xG that a team allowed their rival to generate) for Championship teams.

First, I filter the data to Championship games only and strip all the unnecessary columns.

xG_Against for one team is the xG for another and vise-versa, therefore I add two additional columns for xG_Against just by swapping values and also create a home/away sub datasets while we at it:

Depending on what you are looking for you can go into any direction from here. Meaning, I was looking for average xG numbers per game for each time. You can be looking for total xG for the season or historical average, or maybe you want to compare the performance of one team in different seasons – that’s up to you, I am going for the means:

Now we have two datasets with every Championship team’s home and away performance, but it will be much nicer to have this data in one place isn’t it? So lets merge:

After the merge I perform a bit of clean-up and export the dataset to .csv to be used in my other football projects.

Full code can be found in my repo with scripts for articles notebooks-for-articles. File 538_xG_data.py.

Again, the data presented in this file can be used for much more thorough and extensive analysis – I didn’t even check more than half of the columns! So give it a try and I am pretty sure, you will find something interesting 😉


Photo by Steven Wright on Unsplash

Leave a Reply

Your email address will not be published. Required fields are marked *