{"id":1445,"date":"2022-07-13T22:33:20","date_gmt":"2022-07-13T20:33:20","guid":{"rendered":"https:\/\/www.sergilehkyi.com\/?p=1445"},"modified":"2022-07-13T22:46:07","modified_gmt":"2022-07-13T20:46:07","slug":"data-manipulation-tips-with-python-standard-library","status":"publish","type":"post","link":"https:\/\/www.sergilehkyi.com\/uk\/2022\/07\/data-manipulation-tips-with-python-standard-library\/","title":{"rendered":"Data Manipulation Tips With Python Standard Library"},"content":{"rendered":"\n<p>Pandas is cool and awesome, but what if you just need a small transformation of data, a punctual change and you don&#8217;t want to install heavy things on your server? Or maybe you just launching Lambdas (or any other serverless functions) that have limited capacities in resources? In such scenarios, it is much better to stick to Python&#8217;s standard library and not load it with unnecessary stuff.<\/p>\n\n\n\n<p>Also, knowing a few tricks will make your code much more readable and help you avoid redundancy. So, in this article, I will show you a few data manipulation tricks using only a standard library that might save you time and resources.<\/p>\n\n\n\n<p>We will work with the following <code>.csv<\/code> file as it is the most common format &#8211; a dummy shopping cart in a clothing store (you can just copy-paste it):<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>name,color,category,price,quantity \nt-shirt,black,top,20,1\npants,white,bottom,50,1\nblazer,yellow,top,100,1\nt-shirt,red,top,15,2\nt-shirt,orange,top,25,1\nsneakers,white,footwear,100,1\nbracelet,green,accessories,5,3<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">1. Function enumerate()<\/h2>\n\n\n\n<p>It is not about data manipulation yet, but just a cool thing to consider. The function works like a normal iterator, but it just adds an index to every iteration. It comes in handy especially when reading from files as it allows to track the line numbers. Try this out:<\/p>\n\n\n\n<script src=\"https:\/\/gist.github.com\/slehkyi\/0efe6c9988eff2e0b09f602643ced218.js\"><\/script>\n\n\n\n<p>In the end, those 2 lines do not make a huge difference in performance, but it looks more aesthetic and you avoid loose variables here and there.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Reading .csv files without pandas<\/h2>\n\n\n\n<p>Reading .csv files with standard file reader in Python might get complicated in some cases as it will return the content of the file as the string and we will need to add additional steps to handle that huge string. For these cases it is quite convenient to use module <code>csv<\/code>. This reader gives you option to select a delimiter, quotechar, escapechar and few other options to make .csv import much easier. Try this:<\/p>\n\n\n\n<script src=\"https:\/\/gist.github.com\/slehkyi\/481f5294bdd7d9f2daed953dabc3b9b5.js\"><\/script>\n\n\n\n<p>And you will get the following output:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-9.png\"><img loading=\"lazy\" decoding=\"async\" width=\"432\" height=\"167\" src=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-9.png\" alt=\"\" class=\"wp-image-1448\" srcset=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-9.png 432w, https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-9-300x116.png 300w\" sizes=\"(max-width: 432px) 100vw, 432px\" \/><\/a><\/figure>\n\n\n\n<p>One thing to take into account from this output is that every value from a column will be considered as string, so if you want to perform some operations on those values you will have to adjust the types.  <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Collections<\/h2>\n\n\n\n<p>For the data manipulations part, we will use a module <code>collections<\/code>. From the documentation: &#8220;This module implements specialized container datatypes providing alternatives to Python\u2019s general-purpose built-in containers, <a href=\"https:\/\/docs.python.org\/3\/library\/stdtypes.html#dict\"><code>dict<\/code><\/a>, <a href=\"https:\/\/docs.python.org\/3\/library\/stdtypes.html#list\"><code>list<\/code><\/a>, <a href=\"https:\/\/docs.python.org\/3\/library\/stdtypes.html#set\"><code>set<\/code><\/a>, and <a href=\"https:\/\/docs.python.org\/3\/library\/stdtypes.html#tuple\"><code>tuple<\/code><\/a>.&#8221;<\/p>\n\n\n\n<p>If you have a dataset with clothing items and want to find the total price of t-shirts or count how many different t-shirts you have there, how would you do that? A class <code>Counter()<\/code> will help.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.1. Counter<\/h3>\n\n\n\n<p>Check out the snippet below and see the difference. 1st version looks much better than the 3rd one, even though both are valid and do the same thing.<\/p>\n\n\n\n<script src=\"https:\/\/gist.github.com\/slehkyi\/dc7671a3af170d1e50f0a8061814bd7e.js\"><\/script>\n\n\n\n<p>Also <code>Counter()<\/code> has a few handy methods like <code>most_common()<\/code> and <code>total()<\/code> (<code>total()<\/code> available in Python 3.10 only).<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&gt;&gt;&gt; total_clothes.most_common(3)\n&#91;('jacket', 100), ('t-shirt', 55), ('pants', 50)]\n&gt;&gt;&gt; total_clothes.total()\n220<\/code><\/pre>\n\n\n\n<p>Another cool thing that can be done using <code>Counter()<\/code> is finding the most common words in a text or the most common character in a word. Let&#8217;s start with an easy one:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&gt;&gt;&gt; Counter('abracadabra').most_common(3) # that's an example from Python docs actually\n&#91;('a', 5), ('b', 2), ('r', 2)]<\/code><\/pre>\n\n\n\n<p>Now let&#8217;s use a common dataset for words count &#8211; Shakespear&#8217;s &#8216;King Lear&#8217;. This dataset will be in my <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/slehkyi\/notebooks-for-articles\" target=\"_blank\">repo for articles<\/a>, folder <code>\/data<\/code>. Check the following little program:<\/p>\n\n\n\n<script src=\"https:\/\/gist.github.com\/slehkyi\/3ed0b898f3dfe31f1cfb7e098a656d34.js\"><\/script>\n\n\n\n<p>And here we have all our words counted.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-10.png\"><img loading=\"lazy\" decoding=\"async\" width=\"162\" height=\"457\" src=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-10.png\" alt=\"\" class=\"wp-image-1449\" srcset=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-10.png 162w, https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-10-106x300.png 106w\" sizes=\"(max-width: 162px) 100vw, 162px\" \/><\/a><\/figure>\n\n\n\n<p>What we have done here? We read the file line by line, removed all the punctuation from each line (about this later), split that line into words, created a <code>Counter()<\/code> for the words in each line and in the end added a counter per line to a total counter. Also we can check how many words in total are there:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&gt;&gt;&gt; count_words.total() # only if you have Python 3.10\n27762<\/code><\/pre>\n\n\n\n<p>Obviously, NLP libraries will tackle this more effectively, but hey!, it is pretty awesome to know you can do that with standard library and just few lines of code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.1.1. Removing all punctuation from the string<\/h3>\n\n\n\n<p>This one I have found while writing the program to count words XD. Honestly. I was really looking for a regex pattern to apply for this transformation, but apparently standard library and <code>string<\/code> module already has quite a simple solution. <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>line = line.translate(str.maketrans('', '', string.punctuation))<\/code><\/pre>\n\n\n\n<p>What do we do here?<\/p>\n\n\n\n<ul><li><code>str.translate()<\/code> &#8211; returns a copy of a given string in which each character has been replaced using mapping (translation) table<\/li><li><code>str.maketrans()<\/code> &#8211; creates a translation table, you can read more in detail how exactly it works in Python <a rel=\"noreferrer noopener\" href=\"https:\/\/docs.python.org\/3\/library\/stdtypes.html#str.maketrans\" target=\"_blank\">docs<\/a>, but for this particular example we have to know that the third parameter is always mapped to <code>None<\/code>, or, for those like me, that need plain language to understand things &#8211; it removes any character specified in the third argument.<\/li><li><code>string.punctuation<\/code> &#8211; predefined list of all punctuation characters. You can check that list in docs or simply by running: <code>print(string.punctuation)<\/code><\/li><\/ul>\n\n\n\n<p>So basically we get a new line, but without punctuation and without regex. Now let&#8217;s get back to <code>collections<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.2. defaultdict<\/h3>\n\n\n\n<p>Now, if you don&#8217;t want to add up different values for same keys but just map it to see what is going on there &#8211; <code>defaultdict<\/code> will take care of it:<\/p>\n\n\n\n<script src=\"https:\/\/gist.github.com\/slehkyi\/619f29dc5185373ecd5132532c4305fd.js\"><\/script>\n\n\n\n<p>And again as you can see, with regular dict that won&#8217;t work. Or you will have to write that ugly workaround and check if a key exists. <\/p>\n\n\n\n<p>The first argument of the defaultdict, <code>default_factory<\/code>, sets up a default value for each key. In our first example it was a list. But if we pass an <code>int<\/code> there, we will be able to use it as a counter too. Try the following:<\/p>\n\n\n\n<script src=\"https:\/\/gist.github.com\/slehkyi\/ca4367d0e11a7c9fe23c2363efdbe24a.js\"><\/script>\n\n\n\n<p>But the usage of the <code>Counter()<\/code> is much wider, so bear this in mind. For example, with <code>defaultdicts<\/code> you cannot perform the sum, like we did with the counters when counting words.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3.3. deque<\/h2>\n\n\n\n<p>Imagine you have to keep a history of N last elements or N last lines of the file, <code>deque<\/code> is a perfect solution for this. Now you can run window functions without turning into complex SQL queries. Rather ugly example, but still a food for thought. Imagine you have a dataset with the results of a given football team and you want to calculate their form in recent matches. Dummy dataset is available in <code>data<\/code> folder. Check this out:<\/p>\n\n\n\n<script src=\"https:\/\/gist.github.com\/slehkyi\/4e308761ffc8eea8ab0a6e7c26d6a97d.js\"><\/script>\n\n\n\n<p> And the result:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-11.png\"><img loading=\"lazy\" decoding=\"async\" width=\"549\" height=\"494\" src=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-11.png\" alt=\"\" class=\"wp-image-1452\" srcset=\"https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-11.png 549w, https:\/\/www.sergilehkyi.com\/wp-content\/uploads\/2022\/07\/image-11-300x270.png 300w\" sizes=\"(max-width: 549px) 100vw, 549px\" \/><\/a><\/figure>\n\n\n\n<p>Again, this one lacks some tuning, but I hope you get a general idea.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Last words<\/h2>\n\n\n\n<p>We&#8217;ve got so used to all these fancy packages that we forget the standard library completely. And standard library still got some aces in its sleeve. So the next time you need to perform a quick data manipulation with 0 dependencies on external packages try some of those tips.<\/p>\n\n\n\n<p>Hope you find it useful! Cheers!<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"has-text-align-center\">Photo by <a href=\"https:\/\/unsplash.com\/@andriyko?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Andriyko Podilnyk<\/a> on <a href=\"https:\/\/unsplash.com\/s\/photos\/cat?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Pandas is cool and awesome, but what if you just need a small transformation of data, a punctual change and&hellip;<\/p>\n","protected":false},"author":1,"featured_media":1455,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13,5,17],"tags":[],"translation":{"provider":"WPGlobus","version":"3.0.0","language":"uk","enabled_languages":["gb","es","uk"],"languages":{"gb":{"title":true,"content":true,"excerpt":false},"es":{"title":false,"content":false,"excerpt":false},"uk":{"title":false,"content":false,"excerpt":false}}},"_links":{"self":[{"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/posts\/1445"}],"collection":[{"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/comments?post=1445"}],"version-history":[{"count":9,"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/posts\/1445\/revisions"}],"predecessor-version":[{"id":1458,"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/posts\/1445\/revisions\/1458"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/media\/1455"}],"wp:attachment":[{"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/media?parent=1445"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/categories?post=1445"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/uk\/wp-json\/wp\/v2\/tags?post=1445"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}