Continuing where I left off from Pt. 1.
I demonstrated how you can use Python to add and drop columns. The importance of being able to do this is clear: using data that’s relevant to your model. There is no point in wrangling data, if you do not have a clear goal or purpose. If you are just playing with data to see what you can get, there are plenty of tools you can use. If you’re using this to learn Python and how to work with data using Python. Then my advice is simple: have a project in mind. Your purpose does not need to be a grand scheme. Instead you can create a simple dashboard that conveys meaningful information towards your target audience.
An end goal allows you to know what you will need to do next for your dataset. For the dataset I am using, the end goal was to create a model that determines which counties are for-profit and non-profit businesses viable in the food industry. With that being said. I knew and understood the process I must go through when wrangling my data.
- Identify the columns of data I need and the columns I do not need. Drop the ones I don’t need. Rename the ones I do need.
- The data that are integers are missing in some rows. So the next step is imputation.
Here we are at Pt.2 using imputation to fill in missing data on your dataset.
If you haven’t done so, you should first import the libraries needed and load your dataset.
%matplotlib inline import matplotlib.pyplot as plt import numpy as np import pandas as pd d1 = pd.read_excel('final.xls')
Now in Pt 1. I had d1-d5.. Essentially I took those datasets and merged them using the following code:
#This is what I used to merge my datasets together final = pd.concat([d2, d1], axis=1, join_axes=[d2.index]) #From here I can take the merged dataset and now save it into an xls file like this final.to_excel('final.xls', index=False)
It is important that you are able to merge your datasets before continuing on.
When that was done, I had decisions that needed to be made. First, I need to see if I have any missing data. I used the following line of code to verify:
d1.isnull().sum()
My output was this:
State 0 County 0 2010 Census Population 0 Population Estimate, 2014 0 Food & Retail Tax 2014 0 Fast Food 2009 0 Fast Food 2014 0 Fast Food % Change 09-14 51 Full Service 2009 0 Full Service 2014 0 Full Service % Change 09-14 36 Grocery Store 09 0 Grocery Store 2014 0 Grocery Store % Change 09-14 20 Supercenters & Club Stores 2009 0 Supercenters & Club Stores 2014 0 Supercenters & Club Stores % Change 09-14 140 Convenience Stores 2009 0 Convenience Stores 2014 0 Convenience Stores % Change 09-14 24 Specialized Food Stores 2009 0 Specialized Food Stores 2014 0 Specialized Food Stores % Change 09-14 192 dtype: int64
I have missing data. Any number higher than 0 indicates the amount of data missing from each column. Due to the nature of my missing data (all integers) filling the missing data will be easy. I chose to fill in the missing data by using the mean of each column. I fill the missing data and save it to an xls file using the following lines of code.
d1.fillna(d1.mean(), inplace=True) d1.to_excel('final1.xls', index=False)
Once saved, you can now load the new dataset. I will represent it by labeling/defining it as d2, and check for missing data.
#d2 now represents the new xls file/dataset. There should not be any missing data in this when loaded d2 = pd.read_excel('final1.xls') d2.isnull().sum()
Your output for d2.isnull().sum()
should be:
State 0 County 0 2010 Census Population 0 Population Estimate, 2014 0 Food & Retail Tax 2014 0 Fast Food 2009 0 Fast Food 2014 0 Fast Food % Change 09-14 0 Full Service 2009 0 Full Service 2014 0 Full Service % Change 09-14 0 Grocery Store 09 0 Grocery Store 2014 0 Grocery Store % Change 09-14 0 Supercenters & Club Stores 2009 0 Supercenters & Club Stores 2014 0 Supercenters & Club Stores % Change 09-14 0 Convenience Stores 2009 0 Convenience Stores 2014 0 Convenience Stores % Change 09-14 0 Specialized Food Stores 2009 0 Specialized Food Stores 2014 0 Specialized Food Stores % Change 09-14 0 dtype: int64
With this done, most of my data wrangling is complete. I have a clean dataset that I can now work with. The following will come next:
- Feature Engineering – deciding upon features/columns to add to my dataset.
- Application of Machine Learning Algorithms
Those two steps are a part of the Data Modeling phase. This is typically where the job ends for a Data Analyst. This is where the real work begins for Data Scientists.
For my Business Viability Project. The Modeling Phase included: using logistic regression and creating two features that automates the process of determining viability for non-profit and for-profit.
Leave a comment