Data Science in Action: Analyzing Air Pollution (CO) in Madrid

In the recent years, the high levels of pollution during certain dry periods in Madrid has forced the authorities to take measures against the use of cars in the city center, and has been used as a reason to propose drastic modifications in the city’s urbanization. Thanks to Madrid’s City Council Open Data website, the air quality data has been uploaded and is publicly available. There are several data sets, including daily and hourly historical data of the pollution levels registered from 2001 to 2018 and the list of stations being used for pollution and other particles analysis in the city.

The dataset is really huge, so I decided to focus my analysis only on one pollutant – carbon monoxide (CO). The data was presented hourly, for each of 24 different stations, for each day from 2001 to 2018, although the data on this pollutant for 2002, 2006-2010 was missing.

To get a general picture I found mean value for each day of the year, based on all 24 stations. This process took about an hour on my laptop. By performing this operation and plotting received data we already can make some conclusions.

As we can see from the above plot, actions that have been taken made a positive impact on levels of carbon monoxide in the city. During the years the mean volume of this pollutant drastically decreased.

As I am not a person who usually works with air quality and doesn’t understand it’s mechanics I was wondering why the plot has this shape of triangles. Based on the graph we can easily say that some months the level of pollution is higher and some is lower. I wanted to know those periods.

To find out this, I decided to spot the max and min values for each year and put them into separate tables. Also I’ve put those values on a plot. Seriously, didn’t get any valuable information, but it helps to explain a little trick. So, when I first plotted those values, the graph looked like this:

You see, because of such a huge difference between max and min values we cannot see a trend on min values – it’s almost a straight line. We can improve it by changing the “Y” scale to logarithmic.

plt.yscale('log')

So back to our min and max values for each year. Below you will see tables that explain everything completely. (months are in numbers, I hope you understand that 1 is January and 8 is August, although for the next time I will create a function that will translate those numbers to human language :D)

I was actually surprised that maximum pollution falls into winter months and minimum – into summer. So I used my solving-problem algorithm to find an answer. And here is what I’ve got.

Some sources of pollution, like industrial emissions, stay fairly constant throughout the year, no matter what the season. But roaring fireplaces and wood stoves and idling vehicles in the winter all add up to higher levels of particulate matter (the particles that make up smoke) and carbon monoxide (from vehicle emissions).

On top of this, cold temperatures and stagnant air have a way of creating a build-up of these substances near the ground, particularly during a weather phenomenon called temperature inversions. In other seasons or weather conditions, warm air sits near the ground and the air can rise easily and carry away pollutants. In a temperature inversion, cold air is trapped near the ground by a layer of warm air. The warm air acts like a lid, holding these substances down. During a temperature inversion, smoke can’t rise and carbon monoxide can reach unhealthy levels. From an air quality perspective, storms are a welcome weather event. Wind, rain and snow storms are sometimes called scrubbers because they help clear out and disperse substances of concern.

More detailed info can be found here – “Why Is Air Pollution Worse During Winter?” and here – “How Cold Weather Affects Air Quality”

I, as a person that lives in the city, thought that air pollution is worse in summer, but the data says completely opposite and I cannot argue with it – that’s why I love Data Science, that’s why I love what I do, because data never lies. Yes, you, as a human, can make incorrect interpretation or an error in the code which will distort the results, but nonetheless, data. never. lies.