{"id":510,"date":"2018-10-01T01:11:51","date_gmt":"2018-09-30T23:11:51","guid":{"rendered":"http:\/\/35.180.88.53\/?p=510"},"modified":"2018-10-01T01:11:52","modified_gmt":"2018-09-30T23:11:52","slug":"cleaning-and-preparing-data-in-python","status":"publish","type":"post","link":"https:\/\/www.sergilehkyi.com\/es\/2018\/10\/cleaning-and-preparing-data-in-python\/","title":{"rendered":"Cleaning and Preparing Data in Python"},"content":{"rendered":"\n<p>Data Science sounds like something cool and awesome. It&#8217;s pictured as something cool and awesome. It is a sexiest job of 21st century as we all know (I won&#8217;t even add the link to that article :D). All the cool terms are related to this field &#8211; Machine Learning, Deep Learning, AI, Neural Networks, algorithms, models&#8230; <\/p>\n\n\n\n<p>But all this is just a top of an iceberg. 70-80% of our work is data preprocessing, data cleaning, data transformation, data reprocessing &#8211; all these boring steps to make our data suitable for the model that will make some modern magic.<\/p>\n\n\n\n<p>And today I would like to list all the methods and functions that can help us to clean and prepare the data.<\/p>\n\n\n\n<p>So what can be wrong with our data? A lot of things actually:<\/p>\n\n\n\n<ul><li>Irrelevant column names<\/li><li>Outliers<\/li><li>Duplicates<\/li><li>Missing data<\/li><li>Columns that have to be processed<\/li><li>Unexpected data values<\/li><\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Exploratory analysis<\/h4>\n\n\n\n<p>For the quick overview we can use following methods and attributes of a DataFrame:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.head() # show first 5 rows<br\/>df.tail() # last 5 rows<br\/>df.columns # list all column names<br\/>df.shape # get number of rows and columns<br\/>df.info() # additional info about dataframe<br\/>df.describe() # statistical description, only for numeric values<br\/>df['col_name'].value_counts(dropna=False) # count unique values in a column<br\/><\/pre>\n\n\n\n<p>The output of at least one of these will give us first clues where we want to start our cleaning.<\/p>\n\n\n\n<p>Another way to quickly check the data is by visualizing it. We use bar plots for discrete data counts and histogram for continuous.\u00a0<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df['col_name'].plot('hist')<br\/>df.boxplot(column='col_name1', by='col_name2')<\/pre>\n\n\n\n<p>Histogram and box plot can help to spot visually the outliers. The scatter plot shows relationship between 2 numeric variables.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.plot(kind='scatter', x='col1', y='col2')<br\/><\/pre>\n\n\n\n<p>Visualizing data can bring some unexpected results as it gives you a perspective of what&#8217;s going on in your dataset. Kind of view from the top.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Tidy data<\/h4>\n\n\n\n<p>Tidy data is the data obtained as a result of a process called data tidying. It is one of the important cleaning processes during big data processing and is a recognized step in the practice of data science. Tidy data sets have structure and working with them is easy; they\u2019re easy to manipulate, model and visualize. Tidy data sets main concept is to arrange data in a way that each variable is a column and each observation (or case) is a row.<\/p>\n\n\n\n<p>Tidy data provide standards and concepts for data cleaning, and with tidy data there\u2019s no need to start from scratch and reinvent new methods for data cleaning.<\/p>\n\n\n\n<p>Characteristics:<\/p>\n\n\n\n<ul><li>Each variable you measure should be in one column.<\/li><li>Each different observation of that variable should be in a different row.<\/li><li>There should be one table for each &#8220;kind&#8221; of variable.<\/li><li>If you have multiple tables, they should include a column in the table that allows them to be linked.<\/li><li><em>(all these taken from Wikipedia)<\/em><\/li><\/ul>\n\n\n\n<p>To transform our data and make it tidy we can use melting.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">pd.melt() # transform columns to rows<\/pre>\n\n\n\n<p>There are two parameters you should be aware of: <em>id_vars<\/em> and <em>value_vars<\/em>. The <em>id_vars<\/em> represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the <em>value_vars<\/em> represent the columns you do wish to melt into rows. By default, if no <em>value_vars<\/em> are provided, all columns not set in the <em>id_vars<\/em> will be melted.<br\/><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">new_df = pd.melt(df, id_vars = 'do not melt')<\/pre>\n\n\n\n<p>The opposite operation to melting is pivoting. Here we turn unique values into separate columns. We use it when we want to transform our data from analysis shape to reporting shape, from easy-machine-readable to easy-human-readable form.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.pivot(index='date', columns='element', values='value')<\/pre>\n\n\n\n<p>Although, this method cannot handle duplicate values. When this is a case we should use <em>.pivot_table()<\/em> that has an additional parameter &#8211; <em>aggfunc<\/em>, which will handle those duplicates based on a function we provide (sum, count, mean, min, max or user defined function).\u00a0<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.pivot_table(index='date', columns='element', values='value', aggfunc=np.mean)<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Parsing data<\/h4>\n\n\n\n<p>Sometimes we might have data stored incorrectly. For example, values for the gender groups that stored as &#8216;m014&#8217;, &#8216;f014&#8217;, &#8216;m1528&#8217;, &#8216;f1528&#8217;. You cannot tell it&#8217;s completely wrong, but it would be better to split these values into &#8216;gender&#8217; and &#8216;age&#8217; columns. To do this we use Python slicing sintaxis by accessing .str attribute of column of object type.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df['gender'] = df.variable.str[0]\ndf['age_group'] = df.variable.str[1:]<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Concatenating data<\/h4>\n\n\n\n<p>Also data might not come in one huge file and be separated into few different chunks, so we have to be able to concatenate all that data and clean it or clean the first sample and then apply the same process on remaining parts. To do this we can use pandas <em>.concat<\/em> method, which provided with the list of dataframes will concatenate them all. By default it will store the original indexes what will result in duplicate index values. To prevent this we have to reset index of a new dataframe by passing an additional parameter <em>ignore_index=True.<\/em><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">concatenated = pd.concat([df1, df2], ignore_index=True)<\/pre>\n\n\n\n<p>But what if we have thousands of files? It&#8217;ll be dumb to import them one by one, clean them and repeat it again. And we are not stupid, we know loops in Python. The only missing part is to find all those files for import. We can do this with glob library. So the process will be the following: write a pattern, save all files into a list, iterate over csv files, import each file and concatenate the dataframes into one. Doesn&#8217;t seem that difficult, but with the code sample it&#8217;s much better:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Import necessary modules<br\/>import glob<br\/>import pandas as pd<br\/>\n# Write the pattern: pattern\npattern = '*.csv'<br\/>\n# Save all file matches: csv_files<br\/>csv_files = glob.glob(pattern)<br\/>\n# Create an empty list: frames<br\/>frames = []<br\/>\n# Iterate over csv_files\nfor csv in csv_files:<br\/><br\/>    # Read csv into a DataFrame: df<br\/>    df = pd.read_csv(csv)<br\/><br\/>    # Append df to frames<br\/>    frames.append(df)<br\/><br\/># Concatenate frames into a single DataFrame: final_df<br\/>final_df = pd.concat(frames)<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Merging data<\/h4>\n\n\n\n<p>Merging is the same as SQL join operation. You combine two or more tables into one by key which is present in every table. There are three types of joins: one-to-one, one-many, many-to-many. In SQL this process is advanced, with a lot of options, modifications, where you have to explicitly specify what and how you want to join. Here, all is done for you by one function and the type of join will only depend on a data in a dataframe. If the column name to be used as a key is the same in both dataframes <em>&#8216;on&#8217;<\/em> parameter is used.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">merged = pd.merge(left=df1, right=df2, on=None, left_on='col1', right_on='col2')<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Converting data types<\/h4>\n\n\n\n<p>It is very important to have data with correct data types as later it may play a bad joke with you and your analysis. Remember how once I didn&#8217;t convert one column to the correct data type and spent 1 hour trying to subtract float from string :D. So be careful :D. That&#8217;s a typical error in data, which can be fixed with <em>pd.to_numeric()<\/em> and with <em>errors=&#8217;coerce&#8217;<\/em> it will convert all the err values into NaNs.<\/p>\n\n\n\n<p>To convert data we can use <em>.astype()<\/em> method on a series. Also keep in mind the &#8216;category&#8217; type &#8211; it reduces size of a dataframe and makes computations faster. We can convert to any value that can be used as category &#8211; days of the week, gender, continent abbreviations &#8211; depends on a context.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df['column1'] = df['column1'].astype(str)\ndf['column1'] = df['column1'].astype('category')<br\/>df['column1'] = pd.to_numeric(df['column1'], errors='coerce')<br\/><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Duplicates and missing values<\/h4>\n\n\n\n<p>Our favourite part, isn&#8217;t it? To drop duplicates we can use druuuuuuuuuums drop_duplicates() method.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df = df.drop_duplicates()<\/pre>\n\n\n\n<p>With missing values it is little bit more complicated. In general there are three ways to deal with missing data:<\/p>\n\n\n\n<ul><li>leave as-is<\/li><li>drop them<\/li><li>fill missing values<\/li><\/ul>\n\n\n\n<p>To drop missing values we can use .dropna(), but careful with it &#8211; it may delete up to 50% of your data &#8211; which is not really good. But again, depends on a context.<\/p>\n\n\n\n<p>To fill missing values we use .fillna(), careful with it as well &#8211; if we fill missing values they have to be reasonable and make sense.<\/p>\n\n\n\n<p>There is a wonderful article on handling missing data and I really have nothing to add on this part. Here is the <a href=\"https:\/\/towardsdatascience.com\/how-to-handle-missing-data-8646b18db0d4\">link<\/a> and don&#8217;t forget to give an author a round of applause, because that work is just amazing.\u00a0<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Asserts<\/h4>\n\n\n\n<p>We can also programmatically check our data using assert statements. It will return nothing if the result is True and error otherwise.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">assert 1 == 1 # returns nothing\nassert 1 == 2 # returns error<br\/>assert df.notnull().all().all() # returns error if at least one column has one missing value<br\/><\/pre>\n\n\n\n<p>I skipped regular expressions as it deserves a separate article, but in general summarized tools I use for data preprocessing. Please, let me know if something else can be added, have a great day and use data science for good \ud83d\ude42<\/p>\n\n\n\n<p>Featured Photo by\u00a0<a href=\"https:\/\/unsplash.com\/photos\/9ZRbXlx9zQ4?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Artem Bali<\/a>\u00a0on\u00a0<a href=\"https:\/\/unsplash.com\/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data Science sounds like something cool and awesome. It&#8217;s pictured as something cool and awesome. It is a sexiest job&hellip;<\/p>\n","protected":false},"author":1,"featured_media":511,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,5],"tags":[],"translation":{"provider":"WPGlobus","version":"3.0.0","language":"es","enabled_languages":["gb","es","uk"],"languages":{"gb":{"title":true,"content":true,"excerpt":false},"es":{"title":false,"content":false,"excerpt":false},"uk":{"title":false,"content":false,"excerpt":false}}},"_links":{"self":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/510"}],"collection":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/comments?post=510"}],"version-history":[{"count":1,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/510\/revisions"}],"predecessor-version":[{"id":512,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/510\/revisions\/512"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/media\/511"}],"wp:attachment":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/media?parent=510"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/categories?post=510"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/tags?post=510"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}