Data Wrangling Pt. 2

Continuing where I left off from Pt. 1.

I demonstrated how you can use Python to add and drop columns. The importance of being able to do this is  clear: using data that’s relevant to your model.  There is no point in wrangling data, if you do not have a clear goal or purpose. If you are just playing with data to see what you can get, there are plenty of tools  you can use. If you’re using this to learn Python and how to work with data using Python. Then my advice is simple: have a project in mind. Your purpose does not need to be a grand scheme. Instead you can create a simple dashboard that conveys meaningful information towards your target audience.

An end goal allows you to know what you will need to do next for your dataset. For the dataset I am using, the end goal was to create a model that determines which counties are for-profit and non-profit businesses viable in the food industry. With that being said. I knew and understood the process I must go through when wrangling my data.

  1. Identify the columns of data I need and the columns I do not need. Drop the ones I don’t need. Rename the ones I do need.
  2. The data that are integers are missing in some rows. So the next step is imputation.

Here we are at Pt.2  using imputation to fill in missing data on your dataset.

If you haven’t done so, you should first import the libraries needed and load your dataset.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
d1 = pd.read_excel('final.xls')

Now in Pt 1. I had d1-d5..  Essentially I took those datasets and merged them using the following code:

#This is what I used to merge my datasets together

final = pd.concat([d2, d1], axis=1, join_axes=[d2.index])

#From here I can take the merged dataset and now save it into an xls file like this

final.to_excel('final.xls', index=False)

It is important that you are able to merge your datasets before continuing on.

When that was done, I had decisions that needed to be made. First, I need to see if I have any missing data. I used the following line of code to verify:

d1.isnull().sum()

My output was this:

State                                          0
County                                         0
2010 Census Population                         0
Population Estimate, 2014                      0
Food & Retail Tax 2014                         0
Fast Food 2009                                 0
Fast Food 2014                                 0
Fast Food % Change 09-14                      51
Full Service 2009                              0
Full Service 2014                              0
Full Service % Change 09-14                   36
Grocery Store 09                               0
Grocery Store 2014                             0
Grocery Store % Change 09-14                  20
Supercenters & Club Stores 2009                0
Supercenters & Club Stores 2014                0
Supercenters & Club Stores % Change 09-14    140
Convenience Stores 2009                        0
Convenience Stores 2014                        0
Convenience Stores % Change 09-14             24
Specialized Food Stores 2009                   0
Specialized Food Stores 2014                   0
Specialized Food Stores % Change 09-14       192
dtype: int64

I have  missing data. Any number higher than 0 indicates the amount of data missing from each column. Due to the nature of my missing data (all integers)  filling the missing data will be easy.  I chose to fill in the missing data by using the mean of each column.  I fill the missing data and save it to an xls file using the following lines of code.

d1.fillna(d1.mean(), inplace=True)
d1.to_excel('final1.xls', index=False)

Once saved, you can now load the new dataset. I will represent it by labeling/defining it  as d2, and check for missing data.

#d2 now represents the new xls file/dataset. There should not be any missing data in this when loaded
d2 = pd.read_excel('final1.xls')
d2.isnull().sum()

Your output for d2.isnull().sum()

should be:

 

State                                        0
County                                       0
2010 Census Population                       0
Population Estimate, 2014                    0
Food & Retail Tax 2014                       0
Fast Food 2009                               0
Fast Food 2014                               0
Fast Food % Change 09-14                     0
Full Service 2009                            0
Full Service 2014                            0
Full Service % Change 09-14                  0
Grocery Store 09                             0
Grocery Store 2014                           0
Grocery Store % Change 09-14                 0
Supercenters & Club Stores 2009              0
Supercenters & Club Stores 2014              0
Supercenters & Club Stores % Change 09-14    0
Convenience Stores 2009                      0
Convenience Stores 2014                      0
Convenience Stores % Change 09-14            0
Specialized Food Stores 2009                 0
Specialized Food Stores 2014                 0
Specialized Food Stores % Change 09-14       0
dtype: int64

With this done, most of my data wrangling is complete. I have a clean dataset that I can now work with.  The following will come next:

  1. Feature Engineering – deciding upon features/columns to add to my dataset.
  2. Application of Machine Learning Algorithms

Those two steps are a part of the Data Modeling phase. This is typically where the job ends for a Data Analyst. This is where the real work begins for Data Scientists.

For my Business Viability Project. The Modeling Phase included: using logistic regression and creating two features that automates the process of determining viability for non-profit and for-profit.

 

 

Join the Conversation

1 Comment

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: