Data Wrangling Pt. 1

Data Wrangling, everybody hates it, but it has to be done. With that being said, here is an easy guide to wrangling data.

For me to load my data, I import pandas first. Then I loaded my data. It is the same excel file, but different sheet within the file. d1-d5 has the data that I needed specifically for this project. This dataset comes from the Census Bureau, and is the data I used for my Business Viability Project.

The secret to data wrangling is simply understanding that you can do whatever you want. In this case, you can do what is needed to be done. Meaning, there isn’t one specific methodology to wrangling data.

import pandas as pd

d1 = pd.read_excel('DataDownload.xls', 'STORES')
d2 = pd.read_excel('DataDownload.xls', 'Supplemental Data - County')
d3 = pd.read_excel('DataDownload.xls', 'ASSISTANCE')
d4 = pd.read_excel('DataDownload.xls', 'PRICES_TAXES')
d5 = pd.read_excel('DataDownload.xls', 'RESTAURANTS') 




 

For example, for d1 I want to look at the dtypes within that specific dataset. Then I will type the following line:

d1.dtypes

The output should look like this:

FIPS                     int64
State                   object
County                  object
GROC09                   int64
GROC14                   int64
PCH_GROC_09_14         float64
GROCPTH09              float64
GROCPTH14              float64
PCH_GROCPTH_09_14      float64
SUPERC09                 int64
SUPERC14                 int64
PCH_SUPERC_09_14       float64
SUPERCPTH09            float64
SUPERCPTH14            float64
PCH_SUPERCPTH_09_14    float64
CONVS09                  int64
CONVS14                  int64
PCH_CONVS_09_14        float64
CONVSPTH09             float64
CONVSPTH14             float64
PCH_CONVSPTH_09_14     float64
SPECS09                  int64
SPECS14                  int64
PCH_SPECS_09_14        float64
SPECSPTH09             float64
SPECSPTH14             float64
PCH_SPECSPTH_09_14     float64
SNAPS12                float64
SNAPS16                  int64
PCH_SNAPS_12_16        float64
SNAPSPTH12             float64
SNAPSPTH16             float64
PCH_SNAPSPTH_12_16     float64
WICS08                   int64
WICS12                   int64
PCH_WICS_08_12         float64
WICSPTH08              float64
WICSPTH12              float64
PCH_WICSPTH_08_12      float64
dtype: object

For this example I will only specifically look at what I’ve done for  dataset d1. In this example of data wrangling, I want to change my column names. I also want to drop  a few columns.

d1.rename(columns={'GROC09': 'Grocery Store 2009', 'GROC14': 'Grocery Store 2014', 'PCH_GROC_09_14': 'Grocery Store % Change 09-14', 'SUPERC09': 'Supercenters & Club Stores 2009', 'SUPERC14': 'Supercenters & Club Stores 2014', 'PCH_SUPERC_09_14': 'Supercenters & Club Stores % Change 09-14', 'CONVS09': 'Convenience Stores 2009', 'CONVS14': 'Convenience Stores 2014', 'PCH_CONVS_09_14': 'Convenience Stores % Change 09-14', 'SPECS09': 'Specialized Food Stores 2009', 'SPECS14': 'Specialized Food Stores 2014', 'PCH_SPECS_09_14': 'Specialized Food Stores % Change 09-14' }, inplace=True)

The code above is what I did, to rename the desired columns

d1.rename(columns=['current name': 'new name']) 

So now if I wanted to drop a column I can type the following:

d1.drop('SNAPS12', axis=1, inplace=True)

or

del d1['FIPS']

The first method just drops the columns within the current kernal, while the latter will delete columns directly.

I will repeat this process for d2-d5.

This ends the first part of data wrangling. Next, I will focus on  filling missing data.

Continue to Pt.2

 

 

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: