Data Science Made Easy using Python

Introduction to Python

Python is a general-purpose computer programming language that offers rich functionality for data science. You can use either Python 2 or Python 3 for data science functioning. The selection of the version can be done based on the availability of packages and/or any requirements to support legacy code.

How to set up a data science environment?

Due to many dependencies that need to be satisfied when installing individual packages, it is advisable to use a Python distribution instead of building a data science environment from scratch. Available Python distributions are ‘Anaconda’, ‘Enthought Canopy Express’, ‘PYTHONXY’ and ‘WINPYTHON’. The latter two distributions mentioned are Windows only solutions.

Introduction to Conda

‘Conda’ is a package management tool available in Anaconda that simplifies searching, installing and managing packages together with their dependencies. Conda is also an environment manager that you can use to create separate environments running different Python versions. Another feature of Conda is it can be used together with continuous integration systems such as Travis VI and AppVeyor to support automated code testing. Conda is bundled with all versions of Anaconda, Miniconda and Anaconda repository.

How to install a package?

Click on the title of the package to the install using the structure conda install PACKAGENAME.

Introduction to Pip

When your desired package is unavailable in conda or in Anaconda.org you can use pip to install it. Pip is bundled with Anaconda and miniconda, so a separate install is not necessary. Installing packages via pip is done in a similar way as installing via conda. Just pass the package name using the construct pip install PACKAGENAME.

Introduction to Jupyter notebooks

Although there are many IDEs that support Python, a better way of working with data science code is Jupyter notebooks. A Jupyter notebook is an open source web application that supports creation and sharing of code. Jupyter notebook allows you to create code and documentation, run code and see results. Because of the completeness of the environment you are able to complete a workflow that involves data acquisition, exploration, cleaning and modelling. Jupyter is bundled with Anaconda so there is no extra installation step. To start a Jupyter notebook issue the command jupyter notebook at Anaconda prompt. A web page will be opened in your default browser providing you access to Jupyter notebooks.

Data Science workflow

The previous sections introduced data science workflows and demonstrated how to set up a data science environment. After completing the sections it is expected you have a working environment. The following sections will introduce data acquisition, exploration, cleaning and modelling.

Data Exploration in detail

After reading in data the next step in a workflow is data exploration. The objective here is to understand your data and identify anomalies. Some of the issues you might come across are missing values, range of observations and relationships among variables. In Pandas missing values can either be standard or not standard. Standard missing values are represented as NA therefore Pandas automatically detects them as missing values. To check for standard missing values you use the function isnull. To get the count of missing values in each column of TB data you use the command print tb_burden.isnull().sum()

Building a Model

After getting a good understanding of your data and resolving data-quality problems, the next step is building a model. The scikit-learn package provides an environment for model building and evaluation. Both supervised or unsupervised models can be developed.

Bottom Line

In this tutorial data science workflows were introduced. Use of Python for data science was introduced and setting up a Python environment was demonstrated. Importing data from flat files and relational databases was discussed. Statistical and graphical techniques for data exploration were briefly discussed. Finally available models were discussed.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store