{"id":651,"date":"2018-11-25T18:15:11","date_gmt":"2018-11-25T17:15:11","guid":{"rendered":"http:\/\/35.180.88.53\/?p=651"},"modified":"2018-11-25T18:17:31","modified_gmt":"2018-11-25T17:17:31","slug":"translating-sql-to-pandas","status":"publish","type":"post","link":"https:\/\/www.sergilehkyi.com\/es\/2018\/11\/translating-sql-to-pandas\/","title":{"rendered":"Translating SQL to pandas"},"content":{"rendered":"\n<p>Like a person with SQL background and a person that works a lot with SQL, first steps with pandas were little bit difficult for me. I would always think in terms of SQL and then wonder why pandas is so not-intuitive. But with the time I got used to a syntax and found my own associations between these two. So I decide to create kind of cheat sheet of basic SQL commands translated into pandas language.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Select<\/h4>\n\n\n\n<p>The most simple SQL statement will look like:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"># Select everything from table1<br>SELECT *<br>FROM table1<br><\/pre>\n\n\n\n<p>In pandas will look like:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.head()<\/pre>\n\n\n\n<p>Although, by default, pandas will show only 5 first rows, what in SQL will be equivalent to LIMIT 5 or TOP (5) (depends on DBMS). Also just putting <code>df<\/code> in terminal will show you first and last 15 rows of the DataFrame.<\/p>\n\n\n\n<p>To select specific columns in SQL we will write the next query:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">SELECT column1, column2<br>FROM table1<br><\/pre>\n\n\n\n<p>In pandas that will be written like this:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df[['column1', 'column2']]<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Where<\/h4>\n\n\n\n<p>Normally we want to select data with a specific criteria and in SQL it will be done using WHERE clause:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">SELECT column1, column2<br>FROM table1<br>WHERE column1 = 2<\/pre>\n\n\n\n<p>In pandas this same query looks slightly different:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df[['column1', 'column2']].loc[df['column1'] == 2]<\/pre>\n\n\n\n<p>WHERE clause accepts logical operators &#8211; pandas too. So this SQL query:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">SELECT *<br>FROM table1<br>WHERE column1 &gt; 1 AND column2 &lt; 25<\/pre>\n\n\n\n<p>In pandas takes the following form:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.loc[(df['column1'] &gt; 1) &amp; (df['column2'] &lt; 25)]<\/pre>\n\n\n\n<p>Operator for OR in pandas is &#8216;|&#8217;, NOT &#8211; &#8216;~&#8217;.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.loc[(df['column1'] &gt; 1) | ~(df['column2'] &lt; 25)]<\/pre>\n\n\n\n<p>SQL WHERE clause also accepts more complex comparisons, for example: LIKE, IN and BETWEEN; pandas is also capable of doing it &#8211; just in a different way. Lets create a query where all these operators are used:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">SELECT *<br>FROM table1<br>WHERE column1 BETWEEN 1 and 5 AND column2 IN (20,30,40,50) AND column3 LIKE '%arcelona%'<\/pre>\n\n\n\n<p>This same query in pandas:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.loc[(df['colum1'].between(1,5)) &amp; (df['column2'].isin([20,30,40,50])) &amp; (df['column3'].str.contains('arcelona'))]<\/pre>\n\n\n\n<p>More details on these methods can be found in documentation: <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.DataFrame.isin.html\">IN<\/a>, <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.Series.between.html\">BETWEEN<\/a>, <a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/generated\/pandas.Series.str.contains.html\">LIKE.<\/a><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Join<\/h4>\n\n\n\n<p>Joining tables is a common practice in data scientist&#8217;s routine and in SQL we do that using 4 different types of JOINS: INNER, LEFT, RIGHT and FULL. The query looks like the following:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">SELECT t1.column1, t2.column1<br>FROM table1 t1<br>INNER JOIN table2 t2 ON t1.column_id = t2.column_id<\/pre>\n\n\n\n<p>For all joins the syntax is the same, so we just change INNER to any of 3 types that left. In pandas this operation is better to do in 2 steps &#8211; first perform join and then select the data we need (although it can be done in one line):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df_joined = df1.join(df2, on='column_id', how='inner')<br>df_joined.loc[['column1_df1', 'column1_df2']]<\/pre>\n\n\n\n<p>If parameters <code>on<\/code> and <code>how<\/code> are not specified, pandas will perform LEFT join using indexes as key columns.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Group By<\/h4>\n\n\n\n<p>Grouping allows us to get some aggregation insights about our data: COUNT, SUM, AVG, MIN, MAX and so on:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">SELECT column1, count(*)<br>FROM table1<br>GROUP BY column1<\/pre>\n\n\n\n<p>In pandas we also have this flexibility:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.groupby('column1')['column1'].count()<\/pre>\n\n\n\n<p>We have to specify column name in square brackets to include into results only that column otherwise we will get counts per each column in DataFrame.<\/p>\n\n\n\n<p>If we want to get a sum of all values in column2 grouped by column1 (total sales in every store, for example) and show only those that have reached certain level, lets say more than 1000 units, in SQL we would do the following:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">SELECT store, sum(sales)<br>FROM table1<br>GROUP BY store<br>HAVING sum(sales) &gt; 1000<\/pre>\n\n\n\n<p>In pandas we don&#8217;t have this privilege of HAVING clause, but still we can do that, in two steps: first to group data and then to filter it:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df_grouped = df.groupby('store')['sales'].sum()<br>df_grouped.loc[df_grouped &gt; 1000]<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Order By<\/h4>\n\n\n\n<p>To sort results in SQL we have ORDER BY clause, which is always the last in fetching results from the database. To select all values for the table and sort them by column1 in descending order we write the next query:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">SELECT *<br>FROM table1<br>ORDER BY column1 DESC<br><\/pre>\n\n\n\n<p>In pandas this same query will look like:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df.sort_values(by=['column1'], ascending=False)<\/pre>\n\n\n\n<p>This one for me is the most intuitive and clear. Although, when you practice more all these syntax differences stop bothering you.<\/p>\n\n\n\n<p>We can write much more complex queries in SQL and pandas, but this basic aligning has helped me a lot and I believe might help other SQLers. Pandas is a very powerful library and lets you extract, transform, clean your data in a very efficient way. More about it my articles <a href=\"https:\/\/towardsdatascience.com\/extracting-and-transforming-data-in-python-63291f63d350\">&#8220;Extracting and Transforming Data in Python&#8221;<\/a> and <a href=\"https:\/\/towardsdatascience.com\/cleaning-and-preparing-data-in-python-494a9d51a878\">&#8220;Cleaning and Preparing Data in Python&#8221;<\/a><\/p>\n\n\n\n<p>Thank you for reading and have a great day!<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<p style=\"text-align:center\">\n\nPhoto by&nbsp;<a href=\"https:\/\/unsplash.com\/photos\/Qy-CBKUg_X8?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Aaron Burden<\/a>&nbsp;on&nbsp;<a href=\"https:\/\/unsplash.com\/search\/photos\/autumn-aerial-landscape?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\">Unsplash<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Like a person with SQL background and a person that works a lot with SQL, first steps with pandas were&hellip;<\/p>\n","protected":false},"author":1,"featured_media":652,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[],"translation":{"provider":"WPGlobus","version":"3.0.0","language":"es","enabled_languages":["gb","es","uk"],"languages":{"gb":{"title":true,"content":true,"excerpt":false},"es":{"title":false,"content":false,"excerpt":false},"uk":{"title":false,"content":false,"excerpt":false}}},"_links":{"self":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/651"}],"collection":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/comments?post=651"}],"version-history":[{"count":3,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/651\/revisions"}],"predecessor-version":[{"id":655,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/651\/revisions\/655"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/media\/652"}],"wp:attachment":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/media?parent=651"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/categories?post=651"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/tags?post=651"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}