Data Wrangling, everybody hates it, but it has to be done. With that being said, here is an easy guide to wrangling data.
For me to load my data, I import pandas first. Then I loaded my data. It is the same excel file, but different sheet within the file. d1-d5 has the data that I needed specifically for this project. This dataset comes from the Census Bureau, and is the data I used for my Business Viability Project.
The secret to data wrangling is simply understanding that you can do whatever you want. In this case, you can do what is needed to be done. Meaning, there isn’t one specific methodology to wrangling data.
import pandas as pd d1 = pd.read_excel('DataDownload.xls', 'STORES') d2 = pd.read_excel('DataDownload.xls', 'Supplemental Data - County') d3 = pd.read_excel('DataDownload.xls', 'ASSISTANCE') d4 = pd.read_excel('DataDownload.xls', 'PRICES_TAXES') d5 = pd.read_excel('DataDownload.xls', 'RESTAURANTS')
For example, for d1 I want to look at the dtypes within that specific dataset. Then I will type the following line:
d1.dtypes
The output should look like this:
FIPS int64 State object County object GROC09 int64 GROC14 int64 PCH_GROC_09_14 float64 GROCPTH09 float64 GROCPTH14 float64 PCH_GROCPTH_09_14 float64 SUPERC09 int64 SUPERC14 int64 PCH_SUPERC_09_14 float64 SUPERCPTH09 float64 SUPERCPTH14 float64 PCH_SUPERCPTH_09_14 float64 CONVS09 int64 CONVS14 int64 PCH_CONVS_09_14 float64 CONVSPTH09 float64 CONVSPTH14 float64 PCH_CONVSPTH_09_14 float64 SPECS09 int64 SPECS14 int64 PCH_SPECS_09_14 float64 SPECSPTH09 float64 SPECSPTH14 float64 PCH_SPECSPTH_09_14 float64 SNAPS12 float64 SNAPS16 int64 PCH_SNAPS_12_16 float64 SNAPSPTH12 float64 SNAPSPTH16 float64 PCH_SNAPSPTH_12_16 float64 WICS08 int64 WICS12 int64 PCH_WICS_08_12 float64 WICSPTH08 float64 WICSPTH12 float64 PCH_WICSPTH_08_12 float64 dtype: object
For this example I will only specifically look at what I’ve done for dataset d1. In this example of data wrangling, I want to change my column names. I also want to drop a few columns.
d1.rename(columns={'GROC09': 'Grocery Store 2009', 'GROC14': 'Grocery Store 2014', 'PCH_GROC_09_14': 'Grocery Store % Change 09-14', 'SUPERC09': 'Supercenters & Club Stores 2009', 'SUPERC14': 'Supercenters & Club Stores 2014', 'PCH_SUPERC_09_14': 'Supercenters & Club Stores % Change 09-14', 'CONVS09': 'Convenience Stores 2009', 'CONVS14': 'Convenience Stores 2014', 'PCH_CONVS_09_14': 'Convenience Stores % Change 09-14', 'SPECS09': 'Specialized Food Stores 2009', 'SPECS14': 'Specialized Food Stores 2014', 'PCH_SPECS_09_14': 'Specialized Food Stores % Change 09-14' }, inplace=True)
The code above is what I did, to rename the desired columns
d1.rename(columns=['current name': 'new name'])
So now if I wanted to drop a column I can type the following:
d1.drop('SNAPS12', axis=1, inplace=True)
or
del d1['FIPS']
The first method just drops the columns within the current kernal, while the latter will delete columns directly.
I will repeat this process for d2-d5.
This ends the first part of data wrangling. Next, I will focus on filling missing data.