{"id":499,"date":"2018-09-12T08:12:04","date_gmt":"2018-09-12T06:12:04","guid":{"rendered":"http:\/\/35.180.88.53\/?p=499"},"modified":"2018-09-12T08:20:44","modified_gmt":"2018-09-12T06:20:44","slug":"what-i-learned-yesterday-18-importing-data-in-python","status":"publish","type":"post","link":"https:\/\/www.sergilehkyi.com\/es\/2018\/09\/what-i-learned-yesterday-18-importing-data-in-python\/","title":{"rendered":"What I Learned Yesterday #18 (Importing Data in Python)"},"content":{"rendered":"\n<p>Recently I finished two courses on data import in Python at DataCamp and I was really surprised of the amount of sources that can be used to get data. Here I would like to summarize all those methods and at the same time keen my knowledge. Also I think for others it will be useful as well. So, let&#8217;s begin.<\/p>\n\n\n\n<p>There is huge variety of files that can be used as the data source:\u00a0<\/p>\n\n\n\n<ul><li>flat files &#8211; csv, txt, tsv etc.<\/li><li>pickled files<\/li><li>excel spreadsheets<\/li><li>SAS and Stata files<\/li><li>HDF5<\/li><li>MATLAB<\/li><li>SQL databases<\/li><li>web pages<\/li><li>APIs<\/li><\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Flat files<\/h4>\n\n\n\n<p>Flat files &#8211; txt, csv &#8211; are easy and there are few ways to import them using numpy or pandas.<\/p>\n\n\n\n<p><em>numpy.recfromcsv<\/em> &#8211; Load ASCII data stored in a comma-separated file. The returned array is a record array (if usemask=False, see recarray) or a masked record array (if usemask=True, see ma.mrecords.MaskedRecords).<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">data = np.recfromcsv(file)<\/pre>\n\n\n\n<p><em>numpy.loadtxt<\/em> &#8211;\u00a0This function aims to be a fast reader for simply formatted files. The <em>genfromtxt<\/em>\u00a0function provides more sophisticated handling of, e.g., lines with missing values.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">data = np.loadtxt('file.csv', delimiter=',', skiprows=1, usecols=[0,2])\n<\/pre>\n\n\n\n<p><em>numpy.genfromtxt<\/em> &#8211; Load data from a text file, with missing values handled as specified. Much more sophisticated function that has a lot of parameters to control your import.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)\n<\/pre>\n\n\n\n<p>With pandas it&#8217;s even easier &#8211; one line and you have your file in a DataFrame ready.\u00a0 Also supports optionally iterating or breaking of the file into chunks. <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">data = pd.read_csv(file, nrows=5, header=None, sep='\\t', comment='#',\nna_values='Nothing')<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Pickle<\/h4>\n\n\n\n<p>What the hell is pickle? It is used for serializing and de-serializing a Python object structure. Any object in python can be pickled so that it can be saved on disk. What pickle does is that\u00a0it \u201cserialises\u201d the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script. The code below will print a dictionary that was created somewhere and stored in the file &#8211; pretty cool, isn&#8217;t it?<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import pickle<br\/>\nwith open('data.pkl', 'rb') as file:\n    d = pickle.load(file)<br\/>\nprint(d)<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Excel<\/h4>\n\n\n\n<p>With <em>pandas.read_excel<\/em> that reads an Excel table into a pandas DataFrame and has a lot of customization importing data have never been more pleasant (sounds like TV commercial :D). But it is really true &#8211; documentation for this function is clear and you are actually able to do whatever you want with that Excel file.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df = pd.read_excel('file.xlsx', sheet_name='sheet1')<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">SAS and Stata<\/h4>\n\n\n\n<p>SAS\u00a0stands for Statistical Analysis Software.\u00a0A SAS data set contains data values that are organized as a table of observations (rows) and variables (columns). To open this type of files and import data from it the code sample below will help:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">from sas7bdat import SAS7BDAT<br\/>with SAS7BDAT('some_data.sas7bdat') as file:<br\/>     df_sas = file.to_data_frame()<br\/><\/pre>\n\n\n\n<p>Stata is a powerful statistical software that enables users to analyze, manage, and produce graphical visualizations of data. It is primarily used by researchers in the fields of economics, biomedicine, and political science to examine data patterns. Data stored in .dta files and the best way to import it is <em>pandas.read_stata<\/em><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df = pd.read_stata('file.dta')<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">HDF5<\/h4>\n\n\n\n<p>Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data.\u00a0HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections.\u00a0\u00a0HDF5 simplifies the file structure to include only two major types of object:<\/p>\n\n\n\n<ul><li>Datasets, which are multidimensional arrays of a homogeneous type<\/li><li>Groups, which are container structures which can hold datasets and other groups<\/li><\/ul>\n\n\n\n<p>This results in a truly hierarchical, filesystem-like data format. In fact, resources in an HDF5 file are even accessed using the POSIX-like syntax <em>\/path\/to\/resource<\/em>. Metadata is stored in the form of user-defined, named attributes attached to groups and datasets. More complex storage APIs representing images and tables can then be built up using datasets, groups and attributes.<\/p>\n\n\n\n<p>To import HDF5 file we&#8217;ll need <em>h5py<\/em> library. Code sample below made everything easier and totally understandable for me.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import h5py<br\/>\n# Load file: data\ndata = h5py.File('file.hdf5', 'r')<br\/>\n# Print the keys of the file\nfor key in data.keys():\n    print(key)<br\/><br\/># Now when we know the keys we can get the HDF5 group: group\ngroup = data['group_name']<br\/>\n# Going one level deeper, check out keys of group\nfor key in group.keys():\n    print(key)<br\/><br\/># And so on and so on<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">MATLAB<\/h4>\n\n\n\n<p>A lot of people work with MATLAB and store data in .mat files. So what those files are? These files contain list of variables and objects assigned to them in MATLAB workspace. It&#8217;s not surprising that it is imported in Python as dictionary in which keys are MATLAB variables and values &#8211; objects assigned to these variables. To write and read MATLAB files <em>scipy.io<\/em> package is used.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import scipy.io\nmat = scipy.io.loadmat('some_project.mat')<br\/>print(mat.keys())<br\/><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Relational Databases<\/h4>\n\n\n\n<p>Using drivers to connect to a database we can grab data directly from there. Usually it means: create connection, connect, run the query, fetch the data, close connection. It is possible to do it step by step, but in pandas we have an awesome function that does it for us, so why bother ourselves? It only requires a connection that can be created with <em>sqlalchemy<\/em> package. Below is the example on connecting to sqlite database engine and getting data from it:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">from sqlalchemy import create_engine\nimport pandas as pd<br\/>\n# Create engine\nengine = create_engine('sqlite:\/\/\/localdb.sqlite')<br\/>\n# Execute query and store records in DataFrame\ndf = pd.read_sql_query(\"select * from table\", engine)<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Data from Web<\/h4>\n\n\n\n<p>A separate article should be written on this, but I will highlight few things to at least know where to start. First of all, if we have a direct url to a file we can just use standard <em>pandas.read_csv\/pandas.read_excel<\/em> functions specifying it in the parameter &#8220;file=&#8221;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">df = pd.read_csv('https:\/\/www.example.com\/data.csv', sep=';')<\/pre>\n\n\n\n<p>Apart from this, to get data from the web we need to use HTTP protocol and especially GET method (there are a lot of them, but for the import we don&#8217;t need more). And package <em>requests<\/em> does an incredible job doing this. To access a text from the response received by <em>requests.get<\/em> we just need to use method .text.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import requests\nr = requests.get('http:\/\/www.example.com\/some_html_page')\nprint(r.text)<\/pre>\n\n\n\n<p>r.text will give us a web-page with all html-tags on it &#8211; not very useful, isn&#8217;t it? But here is where the fun begins. We have a <em>BeautifulSoup<\/em> package that can parse that HTML and extract the information we need, in this case all hyperlinks (continuing previous example):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">from bs4 import BeautifulSoup<br\/>html_doc = r.text<br\/>\n# Create a BeautifulSoup object from the HTML\nsoup = BeautifulSoup(html_doc)<br\/><br\/># Find all 'a' tags (which define hyperlinks)\na_tags = soup.find_all('a')<br\/>\n# Print the URLs to the shell\nfor link in a_tags:\n    print(link.get('href'))<br\/><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">API<\/h4>\n\n\n\n<p>In computer programming, an application programming interface (API) is a set of subroutine definitions, communication protocols, and tools for building software. In general terms, it is a set of clearly defined methods of communication between various components. There are a lot of different APIs and the first thing that has to be done is documentation checked, but the truth is almost all APIs return data in JSON format. And we have to be able to catch that result. And again package <em>requests<\/em> will help us with it. (we have to send HTTP GET request to get data from API).<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import requests\nr = requests.get('https:\/\/www.example.com\/some_endpoint')<br\/>\n# Decode the JSON data into a dictionary: json_data\njson_data = r.json()<br\/>\n# Print each key-value pair in json_data\nfor k in json_data.keys():\n    print(k + ': ', json_data[k])<\/pre>\n\n\n\n<p>At this point my little summary gets to its end. Hopefully it will be useful not only for me.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently I finished two courses on data import in Python at DataCamp and I was really surprised of the amount&hellip;<\/p>\n","protected":false},"author":1,"featured_media":508,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,5,7],"tags":[],"translation":{"provider":"WPGlobus","version":"3.0.0","language":"es","enabled_languages":["gb","es","uk"],"languages":{"gb":{"title":true,"content":true,"excerpt":false},"es":{"title":false,"content":false,"excerpt":false},"uk":{"title":false,"content":false,"excerpt":false}}},"_links":{"self":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/499"}],"collection":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/comments?post=499"}],"version-history":[{"count":5,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/499\/revisions"}],"predecessor-version":[{"id":507,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/posts\/499\/revisions\/507"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/media\/508"}],"wp:attachment":[{"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/media?parent=499"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/categories?post=499"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.sergilehkyi.com\/es\/wp-json\/wp\/v2\/tags?post=499"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}