Pandas Intermediate 3

Pandas Intermediate 3#

Description: This notebook discusses:

How to use the backend Plotly to make an interactive chart
The entire pipeline from data cleaning and manipulation, data summary to plotting in Pandas

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Knowledge Required:

Python Basics (Start Python Basics I)
Pandas Basics (Start Pandas Basics I)

Knowledge Recommended:

Completion Time: 90 minutes

Data Format: csv

Libraries Used: Pandas

Research Pipeline: None

# download plotly
%pip install plotly

# make sure that the plots are rendered properly
import plotly.io as pio
pio.renderers.default = "notebook" 

Requirement already satisfied: plotly in /opt/homebrew/lib/python3.11/site-packages (6.2.0)
Requirement already satisfied: narwhals>=1.15.1 in /opt/homebrew/lib/python3.11/site-packages (from plotly) (1.46.0)
Requirement already satisfied: packaging in /Users/mearacox/Library/Python/3.11/lib/python/site-packages (from plotly) (25.0)
Note: you may need to restart the kernel to use updated packages.

# import Pandas 
import pandas as pd

# import urllib
import urllib

# choose Plotly as the backend for plotting
pd.options.plotting.backend='plotly'

# import the graph_objects module from plotly
import plotly.graph_objects as go 

Interactive charts#

Interactive charts can effectively tell a story. They also allow the audience to explore the information in a gradual and interactive way. The process of exploration is also a knowledge-building process for the audience.

In this section, we are going to make an interactive line graph.

As discussed in Pandas intermediate 2, a line graph is usually used to show the change in a value of interest over time. Suppose we are interested in the change of the median of the rent of a 1-bedroom apartment from 2019 - 2023 in the different areas of the state of Massachusetts.

### load the data

# download the file
from pathlib import Path
import urllib.request

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

# Download the file
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/PandasIntermediate3_ma_rent_1b_median.csv'
file = '../data/' + url.split('/')[-1]
urllib.request.urlretrieve(url, file)

# Download success message
print('Sample file ready.')

# read data into a df
ma_rent = pd.read_csv(file)

# take a look at the df
ma_rent

Sample file ready.

	areaname	rent19	rent20	rent21	rent22	rent23
0	State median	1115	1133	1211	1307	1459
1	Barnstable Town	1237	1215	1334	1509	1655
2	Boston-Cambridge-Quincy	1904	2008	2034	2139	2368
3	Brockton	1215	1244	1270	1392	1516
4	Lawrence	1123	1184	1219	1307	1487
5	Lowell	1264	1270	1282	1450	1587
6	Pittsfield	885	862	962	1065	1161
7	Berkshire County	1014	961	961	1086	1148
8	Providence-Fall River	948	958	1030	1110	1275
9	Taunton-Mansfield-Norton	1024	1017	1073	1198	1339
10	Easton-Raynham	1181	1211	1297	1495	1759
11	New Bedford	826	834	875	958	1106
12	Springfield	874	922	954	940	1059
13	Worcester	1021	1178	1214	1243	1361
14	Eastern Worcester County	1012	1016	1043	1138	1372
15	Fitchburg-Leominster	881	859	865	933	1116
16	Western Worcester County	773	781	779	836	1001
17	Dukes County	1641	1650	1855	2068	2200
18	Franklin County	929	907	1043	982	1066
19	Nantucket County	1426	1447	1925	1983	2142

### explore the data
# what is the min rent over the years
print(ma_rent.min())

# what is the max rent over the years
print(ma_rent.max())

areaname    Barnstable Town
rent19                  773
rent20                  781
rent21                  779
rent22                  836
rent23                 1001
dtype: object
areaname    Worcester
rent19           1904
rent20           2008
rent21           2034
rent22           2139
rent23           2368
dtype: object

How would we want to make an interactive line graph? One possibility is that we can make a dropdown menu of the different areas of MA. Depending on which area in MA the user selects, a line graph showing the change in the median rent of a 1-bedroom apartment in that area will be displayed. We can also put a line representing the state median rent on the graph as a benchmark.

### make a figure with appropriate y axies range
fig = go.Figure(layout_yaxis_range=[700,2500])

### make the dropdown menu
buttons = []
x_val = ['2019', '2020', '2021', '2022', '2023'] # specify the x values
for row in ma_rent.index[1:]: # loop through all rows except the state median row
    buttons.append(dict(
                        label=ma_rent.loc[row,'areaname'], # get the area name
                        method='update', # specify how we will modify the chart when clicking on a button 
                        visible=True, # specify that data points will be shown
                        args=[{'y': [ma_rent.iloc[0,1:], ma_rent.iloc[row,1:]], # y values for the two lines
                               'x': [x_val,x_val], # x values for the two lines
                               'name':['State Median', ma_rent.loc[row,'areaname']] # names for the two lines
                               }
                             ]
                        )
                  )

### Draw the line graph users see initially

# draw the line for state median
fig.add_trace(go.Scatter(x=x_val,
                         y=ma_rent.iloc[0,1:],
                         name='State Median',
                         line=dict(color="darkgreen", dash="dash"))
             )


# draw the line for Barstable Town
fig.add_trace(go.Scatter(x=x_val,
                         y=ma_rent.iloc[1,1:],
                         name='Barnstable Town',
                         line=dict(color="blue", dash="dashdot"))
             )

This is the line graph the users see when they have not selected any area from the dropdown buttons. We haven’t put the dropdown buttons on the graph yet. Let’s do that.

# add the dropdown menu to the graph
fig.update_layout(
    updatemenus=[
        dict(
            active=0, #button with index 0 is active
            buttons=buttons, # add the buttons
            direction="down", # specify that it is a dropdown menu
            x=0.3,  # specify the position of the buttons along the x axis
            xanchor="left", 
            y=1.2, # specify the position of the buttons along the y axis
            yanchor="top",
            )            
            ],
    legend=
        dict(
            yanchor="top", 
            y=-0.1, # specify the position of the legend along the y axis
            xanchor="left",
            x=0.2, # specify the position of the legend along the x axis
            orientation='h'
            )
            )

Put everything together in a mock project#

From Pandas basics 1 to Pandas intermediate 3, you have learned how to do data cleaning and manipulation, how to summarize data and how to plot the data. In this section, you’ll do a small mock project that puts everything together.

Suppose you are assigned the following task: for the 10 non-US countries with the most runners in the 2019 Boston Marathon, you need to make an interactive bar chart showing the number of female and male runners from these 10 countries from 2017 Boston Marathon to 2019 Boston Marathon.

### get the data

# download the files
# Get the urls to the files and download the files
urls = ['https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2017.csv',
       'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2018.csv',
       'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2019.csv']
for url in urls:
    urllib.request.urlretrieve(url, '../data/'+url.rsplit('/')[-1][9:])
    
# Success message
print('Sample files ready.')

Sample files ready.

To make the desired interactive bar chart, we’ll need to get the number of female and the male runners from the 10 non-US countries with the most runners in 2019 Boston Marathon by year. This means, we will need to extract the relevant data and summarize it in the following way.

Data cleaning and manipulation#

Let’s first get all the data from the files that are relevant for us in this mock project.

### read the data into dfs (code example in pandas basics 1)

### explore the data to get a general idea (code example in pandas basics 1)

### get the 10 non-US countries with the most runners 
### in bm_19 (code example in pandas basics 2)


### reduce bm_19 to columns and rows of interest, 
### create a new column 'year' (code example in pandas basics 1 and 2, intermediate 2)

### reduce bm_17 and bm_18 to columns and rows of interest
### create a new column 'year'(code example in pandas basics 1 and 2, intermediate 2)

### concatenate the three reduced dfs(code example in pandas basics 3)

Data summary#

For each of the 10 non-US countries we would like to make a bar chart to show the number of female and male runners from 2017 to 2019. To achieve this purpose, we need to get the number of female and male runners from these countries by year.

### get the number of female and male runners from 
### each country by year (code example in pandas intermediate 1)

Plotting#

### make a figure with an initial bar chart for Australia
### which is alphabetically the first among the 10 countries 
### (code example in pandas intermediate 3)

### create the buttons in dropdown menu
### (code example in pandas intermediate 3)

### Add the dropdown menu to the chart
### specify how you would want to update the chart with different buttons
### (code example in pandas intermediate 3)

Lesson Complete#

Congratulations! You have completed Pandas Intermediate 3.

Exercise Solutions#

Here are a few solutions for exercises in this lesson.

Data cleaning and manipulation#

### make an interactive bar chart for boston marathon 17-19
### showing the number of female and male runners from the 
### 10 non-US countries with the most runners in 2019 

# download the files

# Get the urls to the files and download the files
urls = ['https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2017.csv',
       'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2018.csv',
       'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2019.csv']
for url in urls:
    urllib.request.urlretrieve(url, '../data/'+url.rsplit('/')[-1][9:])
    
# Success message
print('Sample files ready.')

# read the data into dfs
bm_17 = pd.read_csv('../data/BostonMarathon2017.csv')
bm_18 = pd.read_csv('../data/BostonMarathon2018.csv')
bm_19 = pd.read_csv('../data/BostonMarathon2019.csv')

# explore the data to get a general idea

Sample files ready.

# get the 10 non-US countries with the most runners in bm_19
ctry = bm_19.groupby('CountryOfResName').size().sort_values(ascending=0).iloc[1:11].sort_index().index

#reduce bm_19 to columns and rows of interest, 
#create a new column 'year'
bm_19 = bm_19.loc[bm_19['CountryOfResName'].isin(ctry),['CountryOfResName','Gender']].reset_index(drop=True).copy()
bm_19['Year']='2019'

# reduce bm_17 and bm_18 to columns and rows of interest,
# create a new column 'year' for them respectively
bm_17 = bm_17.loc[bm_17['CountryOfResName'].isin(ctry),['CountryOfResName','Gender']].reset_index(drop=True).copy()
bm_17['Year']='2017'

bm_18 = bm_18.loc[bm_18['CountryOfResName'].isin(ctry),['CountryOfResName','Gender']].reset_index(drop=True).copy()
bm_18['Year']='2018'

# concatenate the three reduced dfs
bm_17to19 = pd.concat([bm_17, bm_18, bm_19]).reset_index(drop=True)

Data summary#

### get the number of female and male runners from each country by year
bm_17to19 = bm_17to19.groupby(['CountryOfResName', 'Gender', 'Year']).size()

Plotting#

### make a figure with an initial bar chart for Australia
### which is alphabetically the first among the 10 countries
fig = go.Figure(
    data=[go.Bar(x=list(range(2017, 2020)),
            y=bm_17to19.loc[('Australia','F')],
               name='Female'
                     ),
    go.Bar(x=list(range(2017, 2020)),
            y=bm_17to19.loc[('Australia','M')],
               name='Male'
                     )
         ])

fig.show()

### create the buttons in the dropdown menu
buttons = []
x = list(range(2017,2020))
for c in ctry:
    buttons.append(dict(
                        label=c,
                        method='update',
                        visible=True,
                        args=[{'y': [bm_17to19.loc[(c,'F')],bm_17to19.loc[(c,'M')]],
                               'x': [x,x]
                               }
                             ]
                        )
                  )

### Add the dropdown menu to the chart
### specify how you would want to update the chart with different buttons
fig.update_layout(
    updatemenus=[
        dict(
            active=0, #button with index 0 is active
            buttons=buttons, # add the buttons
            direction="down", # specify it is a dropdown menu
            x=0.3, # specify where the button stands on the x axis
            xanchor="left",
            y=1.2, # specify where the button stands on the y axis
            yanchor="top",
            )            
            ],
    legend=
        dict(
            yanchor="top", 
            y=-0.1, # specify where the legend stands on the y axis
            xanchor="left",
            x=0.2, # specify where the legend stands on the x axis
            orientation='h'
            )
            )

Pandas Intermediate 3

Contents

Pandas Intermediate 3#

Interactive charts#

Put everything together in a mock project#

Data cleaning and manipulation#

Data summary#

Plotting#

Lesson Complete#

Exercise Solutions#

Data cleaning and manipulation#

Data summary#

Plotting#