{"id":543,"date":"2018-10-28T19:08:50","date_gmt":"2018-10-28T18:08:50","guid":{"rendered":"http:\/\/35.180.88.53\/?p=543"},"modified":"2018-10-28T19:48:41","modified_gmt":"2018-10-28T18:48:41","slug":"tips-on-working-with-datetime-index-in-pandas","status":"publish","type":"post","link":"https:\/\/www.sergilehkyi.com\/es\/2018\/10\/tips-on-working-with-datetime-index-in-pandas\/","title":{"rendered":"Tips on Working with Datetime Index in pandas"},"content":{"rendered":"\n<p>As you may understand from the title it is not a complete guide on Time Series or Datetime data type in Python. So if you expect to get in-depth explanation from A to Z it&#8217;s a wrong place. Seriously. There is a fantastic <a href=\"https:\/\/jakevdp.github.io\/PythonDataScienceHandbook\/03.11-working-with-time-series.html\">article<\/a> on this topic, well explained, detailed and quite straightforward. Don&#8217;t waste your time on this one.<\/p>\n\n\n\n<p>For those who have reached this part I will tell that you will find something useful here for sure. Again, seriously. I found my notes on Time Series and decided to organize it into a little article with general tips, which are aplicable, I guess, in 80 to 90% of times you work with dates. So it&#8217;s worth sharing, isn&#8217;t it?<\/p>\n\n\n\n<p>I have a dataset with air pollutants measurements for every hour since 2016 in Madrid, so I will use it as an example.\u00a0<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Importing data<\/h3>\n\n\n\n<p>By default pandas will use the first column as index while importing csv file with <code>read_csv()<\/code>, so if your datetime column isn&#8217;t first you will need to specify it explicitly <code>index_col='date'.<\/code><\/p>\n\n\n\n<p>The beauty of pandas is that it can preprocess your datetime data during import. By\u00a0specifying <code>parse_dates=True<\/code> pandas will try parsing the index, if we pass list of ints or names e.g. if [1, 2, 3] &#8211; it will try parsing columns 1, 2, 3 each as a separate date column,\u00a0list of lists e.g. if [[1, 3]] &#8211; combine columns 1 and 3 and parse as a single date column,\u00a0dict, e.g. {\u2018foo\u2019 : [1, 3]} &#8211; parse columns 1, 3 as date and call result \u2018foo\u2019. If you are using other method to import data you can always use\u00a0<code>pd.to_datetime<\/code>\u00a0after it.<\/p>\n\n\n\n<p>I have imported my data using the following code:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import pandas as pd<br>import glob<br><br>pattern = 'data\/madrid*.csv'<br>csv_files = glob.glob(pattern)<br><br>frames = []<br><br>for csv in csv_files:<br>    df = pd.read_csv(csv, index_col='date', parse_dates=True)<br>    frames.append(df)<br><br>df = pd.concat(frames)<br>df.head()<br><br>Out[4]: <br>                     BEN  CH4   CO  EBE  NMHC     NO   NO_2  NOx  O_3  PM10  <br>date                                                                          <br>2016-11-01 01:00:00  NaN  NaN  0.7  NaN   NaN  153.0   77.0  NaN  NaN   NaN   <br>2016-11-01 01:00:00  3.1  NaN  1.1  2.0  0.53  260.0  144.0  NaN  4.0  46.0   <br>2016-11-01 01:00:00  5.9  NaN  NaN  7.5   NaN  297.0  139.0  NaN  NaN   NaN   <br>2016-11-01 01:00:00  NaN  NaN  1.0  NaN   NaN  154.0  113.0  NaN  2.0   NaN   <br>2016-11-01 01:00:00  NaN  NaN  NaN  NaN   NaN  275.0  127.0  NaN  2.0   NaN   <\/pre>\n\n\n\n<p>The data is gathered from 24 different stations about 14 different pollutants. We are not going to analyze this data, and to make it little bit simpler we will choose only one station, two pollutants and remove all NaN values (DANGER! please, do not repeat it at home).<\/p>\n\n\n\n<ul class=\"wp-block-gallery columns-1 is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\"><li class=\"blocks-gallery-item\"><figure><img loading=\"lazy\" decoding=\"async\" width=\"568\" height=\"321\" src=\"http:\/\/35.180.88.53\/wp-content\/uploads\/2018\/10\/one-does-not-simply-1.jpg\" alt=\"\" data-id=\"545\" data-link=\"http:\/\/35.180.88.53\/?attachment_id=545\" class=\"wp-image-545\" srcset=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2018\/10\/one-does-not-simply-1.jpg 568w, https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2018\/10\/one-does-not-simply-1-300x170.jpg 300w\" sizes=\"(max-width: 568px) 100vw, 568px\" \/><\/figure><\/li><\/ul>\n\n\n\n<pre class=\"wp-block-preformatted\">df_time = df[['O_3', 'PM10']][df['station'] == 28079008].dropna()\ndf_time.head()\nOut[9]: \n                     O_3  PM10\ndate                          \n2016-11-01 01:00:00  4.0  46.0\n2016-11-01 02:00:00  4.0  37.0\n2016-11-01 03:00:00  4.0  31.0\n2016-11-01 04:00:00  5.0  31.0\n2016-11-01 05:00:00  6.0  27.0<\/pre>\n\n\n\n<p>Now when we have our data prepared we can play with Datetime Index.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Selecting values<\/h3>\n\n\n\n<p>Although the default pandas datetime format is ISO8601 (&#8220;yyyy-mm-dd hh:mm:ss&#8221;) when selecting data using partial string indexing it understands a lot of other different formats. For example:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>df_time.loc['2016-11-01'].head()<\/strong><br>Out[17]: <br>                     O_3  PM10<br>date                          <br>2016-11-01 01:00:00  4.0  46.0<br>2016-11-01 02:00:00  4.0  37.0<br>2016-11-01 03:00:00  4.0  31.0<br>2016-11-01 04:00:00  5.0  31.0<br>2016-11-01 05:00:00  6.0  27.0<br><br><strong>df_time.loc['November 1, 2016'].head()<\/strong><br>Out[18]: <br>                     O_3  PM10<br>date                          <br>2016-11-01 01:00:00  4.0  46.0<br>2016-11-01 02:00:00  4.0  37.0<br>2016-11-01 03:00:00  4.0  31.0<br>2016-11-01 04:00:00  5.0  31.0<br>2016-11-01 05:00:00  6.0  27.0<br><br><strong>df_time.loc['2016-Nov-1'].head()<\/strong><br>Out[19]: <br>                     O_3  PM10<br>date                          <br>2016-11-01 01:00:00  4.0  46.0<br>2016-11-01 02:00:00  4.0  37.0<br>2016-11-01 03:00:00  4.0  31.0<br>2016-11-01 04:00:00  5.0  31.0<br>2016-11-01 05:00:00  6.0  27.0<\/pre>\n\n\n\n<p>All produce the same output. So we are free to use whatever is more comfortable for us. Also we can select data for entire month:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>df_time.loc['2016-11'].head()<\/strong><br>Out[23]: <br>                     O_3  PM10<br>date                          <br>2016-11-01 01:00:00  4.0  46.0<br>2016-11-01 02:00:00  4.0  37.0<br>2016-11-01 03:00:00  4.0  31.0<br>2016-11-01 04:00:00  5.0  31.0<br>2016-11-01 05:00:00  6.0  27.0<br><br><strong>df_time.loc['2016-11'].count()<\/strong><br>Out[24]: <br>O_3     715<br>PM10    715<br>dtype: int64<\/pre>\n\n\n\n<p>The same works if we want to select entire year:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>df_time.loc['2016'].head()<\/strong><br>Out[31]: <br>                     O_3  PM10<br>date                          <br>2016-11-01 01:00:00  4.0  46.0<br>2016-11-01 02:00:00  4.0  37.0<br>2016-11-01 03:00:00  4.0  31.0<br>2016-11-01 04:00:00  5.0  31.0<br>2016-11-01 05:00:00  6.0  27.0<br><br><strong>df_time.loc['2016'].count()<\/strong><br>Out[32]: <br>O_3     8720<br>PM10    8720<br>dtype: int64<\/pre>\n\n\n\n<p>If we want to slice data and find records for some specific period of time we continue to use <code>loc<\/code> accessor, all the rules are the same as for regular index:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>df_time.loc['2017-11-02 23:00' : '2017-12-01'].head()<\/strong><br>Out[34]: <br>                     O_3  PM10<br>date                          <br>2017-11-02 23:00:00  5.0  30.0<br>2017-11-03 00:00:00  5.0  25.0<br>2017-11-03 01:00:00  5.0  12.0<br>2017-11-03 02:00:00  6.0   8.0<br>2017-11-03 03:00:00  7.0  14.0<br><br><strong>df_time.loc['2017-11-02 23:00' : '2017-12-01'].count()<\/strong><br>Out[35]: <br>O_3     690<br>PM10    690<br>dtype: int64<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Resampling<\/h2>\n\n\n\n<p>Pandas has a simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications.<\/p>\n\n\n\n<p><code>resample()<\/code> is a time-based groupby, followed by a reduction method on each of its groups.<\/p>\n\n\n\n<p>The\u00a0<code>resample<\/code>\u00a0function is very flexible and allows us to specify many different parameters to control the frequency conversion and resampling operation.\u00a0<code>sum<\/code>,\u00a0<code>mean<\/code>,\u00a0<code>std<\/code>,\u00a0<code>sem<\/code>,<code>max<\/code>,\u00a0<code>min<\/code>,\u00a0<code>median<\/code>,\u00a0<code>first<\/code>,\u00a0<code>last<\/code>,\u00a0<code>ohlc<\/code><br>are available\u00a0as a method of the returned object by\u00a0<code>resample()<\/code><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Converting hourly data into monthly data<br><strong>df_time.resample('M').mean().head()<\/strong><br>Out[46]: <br>                  O_3       PM10<br>date                            <br>2016-01-31  21.871622  19.990541<br>2016-02-29  32.241679  25.853835<br>2016-03-31  44.234014  16.952381<br>2016-04-30  46.845938  12.189076<br>2016-05-31  53.136671  14.671177<\/pre>\n\n\n\n<p>For upsampling, we can specify a way to upsample\u00a0to interpolate over the gaps that are created:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Converting hourly data into 10-minutely data<br><strong>df_time.resample('10Min').mean().head()<\/strong><br>Out[46]: <br>                     O_3  PM10<br>date                          <br>2016-01-01 01:00:00  8.0  17.0<br>2016-01-01 01:10:00  NaN   NaN<br>2016-01-01 01:20:00  NaN   NaN<br>2016-01-01 01:30:00  NaN   NaN<br>2016-01-01 01:40:00  NaN   NaN<br><br><strong>df_time.resample('10Min').mean().ffill().head()<\/strong><br>Out[47]: <br>                     O_3  PM10<br>date                          <br>2016-01-01 01:00:00  8.0  17.0<br>2016-01-01 01:10:00  8.0  17.0<br>2016-01-01 01:20:00  8.0  17.0<br>2016-01-01 01:30:00  8.0  17.0<br>2016-01-01 01:40:00  8.0  17.0<\/pre>\n\n\n\n<p>We can use the following methods to fill the NaN values:\u00a0<em>\u2018pad\u2019, \u2018backfill\u2019, \u2018ffill\u2019, \u2018bfill\u2019, \u2018nearest\u2019. <\/em>More details on this can be found in <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.core.resample.Resampler.fillna.html\">documentation<\/a>. Or we can do it using interpolation with following methods:\u00a0<em>\u2018linear\u2019, \u2018time\u2019, \u2018index\u2019, \u2018values\u2019, \u2018nearest\u2019, \u2018zero\u2019, \u2018slinear\u2019, \u2018quadratic\u2019, \u2018cubic\u2019, \u2018barycentric\u2019, \u2018krogh\u2019, \u2018polynomial\u2019, \u2018spline\u2019, \u2018piecewise_polynomial\u2019, \u2018from_derivatives\u2019, \u2018pchip\u2019, \u2018akima\u2019. <\/em>And again, deeper explanation on this can be found in <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.core.resample.Resampler.interpolate.html#pandas.core.resample.Resampler.interpolate\">pandas docs.<\/a><\/p>\n\n\n\n<p>And a table of resampling frequencies:<\/p>\n\n\n\n<table class=\"wp-block-table aligncenter is-style-stripes\"><tbody><tr><td><strong>Input<\/strong><\/td><td><strong>Description<\/strong><\/td><\/tr><tr><td>min, T<\/td><td>minute<\/td><\/tr><tr><td>H<\/td><td>hour<\/td><\/tr><tr><td>D<\/td><td>day<\/td><\/tr><tr><td>B<\/td><td>business day<\/td><\/tr><tr><td>W<\/td><td>week<\/td><\/tr><tr><td>M<\/td><td>month<\/td><\/tr><tr><td>Q<\/td><td>quarter<\/td><\/tr><tr><td>A<\/td><td>year<\/td><\/tr><\/tbody><\/table>\n\n\n\n<h3 class=\"wp-block-heading\">Visualization<\/h3>\n\n\n\n<p>And another one awesome feature of Datetime Index is simplicity in plotting, as matplotlib will automatically treat it as x axis, so we don&#8217;t need to explicitly specify anything.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import seaborn as sns<br>sns.set()<br>df_plot = df_time.resample('M').mean()<br><strong>plt.plot(df_plot)<\/strong><br>plt.title('Air polution by O3 and PM10')<br>plt.ylabel('micrograms per cubic meter (mg\/m3)')<br>plt.xticks(rotation=45)<br>plt.show()<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" src=\"http:\/\/35.180.88.53\/wp-content\/uploads\/2018\/10\/o3-and-pm10-pollution.png\" alt=\"\" class=\"wp-image-547\" srcset=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2018\/10\/o3-and-pm10-pollution.png 640w, https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2018\/10\/o3-and-pm10-pollution-300x225.png 300w\" sizes=\"(max-width: 640px) 100vw, 640px\" \/><\/figure>\n\n\n\n<p>As promised in the beginning &#8211; few tips, that help in the majority of situations when working with datetime data. For me &#8211; one more refresher and organizer of thoughts that converts into knowledge. All win. Someone will find it useful, someone might not (I warned in the first paragraph :D), so actually I expect everyone reading this will find it useful.<\/p>\n\n\n\n<p>This is the most exciting feature of knowledge &#8211; when you share it, you don&#8217;t loose anything, you only gain. To write an article, it requires some research, some verification, some learning &#8211; basically you get even more knowledge in the end.\u00a0<\/p>\n\n\n\n<p>Knowledge is just a tool. And it&#8217;s your responsibility to apply it or not. In the end of the day it doesn&#8217;t matter how much you know, it&#8217;s about how you use that knowledge. But that&#8217;s already another story&#8230;<\/p>\n\n\n\n<p>Thank you for reading, have an incredible week, learn, spread the knowledge, use it wisely and use it for good deeds \ud83d\ude42<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As you may understand from the title it is not a complete guide on Time Series or Datetime data type&hellip;<\/p>\n","protected":false},"author":1,"featured_media":548,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,5],"tags":[],"translation":{"provider":"WPGlobus","version":"3.0.0","language":"es","enabled_languages":["gb","es","uk"],"languages":{"gb":{"title":true,"content":true,"excerpt":false},"es":{"title":false,"content":false,"excerpt":false},"uk":{"title":false,"content":false,"excerpt":false}}},"_links":{"self":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/543"}],"collection":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/comments?post=543"}],"version-history":[{"count":4,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/543\/revisions"}],"predecessor-version":[{"id":553,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/543\/revisions\/553"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/media\/548"}],"wp:attachment":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/media?parent=543"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/categories?post=543"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/tags?post=543"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}