Pandas Intermediate 3#

Description: This notebook discusses:

  • How to use the backend Plotly to make an interactive chart

  • The entire pipeline from data cleaning and manipulation, data summary to plotting in Pandas

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Knowledge Required:

Knowledge Recommended:

Completion Time: 90 minutes

Data Format: csv

Libraries Used: Pandas

Research Pipeline: None


# download plotly
%pip install plotly

# make sure that the plots are rendered properly
import plotly.io as pio
pio.renderers.default = "notebook" 
Requirement already satisfied: plotly in /opt/homebrew/lib/python3.11/site-packages (6.2.0)
Requirement already satisfied: narwhals>=1.15.1 in /opt/homebrew/lib/python3.11/site-packages (from plotly) (1.46.0)
Requirement already satisfied: packaging in /Users/mearacox/Library/Python/3.11/lib/python/site-packages (from plotly) (25.0)
Note: you may need to restart the kernel to use updated packages.
# import Pandas 
import pandas as pd

# import urllib
import urllib
# choose Plotly as the backend for plotting
pd.options.plotting.backend='plotly'
# import the graph_objects module from plotly
import plotly.graph_objects as go 

Interactive charts#

Interactive charts can effectively tell a story. They also allow the audience to explore the information in a gradual and interactive way. The process of exploration is also a knowledge-building process for the audience.

In this section, we are going to make an interactive line graph.

As discussed in Pandas intermediate 2, a line graph is usually used to show the change in a value of interest over time. Suppose we are interested in the change of the median of the rent of a 1-bedroom apartment from 2019 - 2023 in the different areas of the state of Massachusetts.

### load the data

# download the file
from pathlib import Path
import urllib.request

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

# Download the file
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/PandasIntermediate3_ma_rent_1b_median.csv'
file = '../data/' + url.split('/')[-1]
urllib.request.urlretrieve(url, file)

# Download success message
print('Sample file ready.')

# read data into a df
ma_rent = pd.read_csv(file)

# take a look at the df
ma_rent
Sample file ready.
areaname rent19 rent20 rent21 rent22 rent23
0 State median 1115 1133 1211 1307 1459
1 Barnstable Town 1237 1215 1334 1509 1655
2 Boston-Cambridge-Quincy 1904 2008 2034 2139 2368
3 Brockton 1215 1244 1270 1392 1516
4 Lawrence 1123 1184 1219 1307 1487
5 Lowell 1264 1270 1282 1450 1587
6 Pittsfield 885 862 962 1065 1161
7 Berkshire County 1014 961 961 1086 1148
8 Providence-Fall River 948 958 1030 1110 1275
9 Taunton-Mansfield-Norton 1024 1017 1073 1198 1339
10 Easton-Raynham 1181 1211 1297 1495 1759
11 New Bedford 826 834 875 958 1106
12 Springfield 874 922 954 940 1059
13 Worcester 1021 1178 1214 1243 1361
14 Eastern Worcester County 1012 1016 1043 1138 1372
15 Fitchburg-Leominster 881 859 865 933 1116
16 Western Worcester County 773 781 779 836 1001
17 Dukes County 1641 1650 1855 2068 2200
18 Franklin County 929 907 1043 982 1066
19 Nantucket County 1426 1447 1925 1983 2142
### explore the data
# what is the min rent over the years
print(ma_rent.min())

# what is the max rent over the years
print(ma_rent.max())
areaname    Barnstable Town
rent19                  773
rent20                  781
rent21                  779
rent22                  836
rent23                 1001
dtype: object
areaname    Worcester
rent19           1904
rent20           2008
rent21           2034
rent22           2139
rent23           2368
dtype: object

How would we want to make an interactive line graph? One possibility is that we can make a dropdown menu of the different areas of MA. Depending on which area in MA the user selects, a line graph showing the change in the median rent of a 1-bedroom apartment in that area will be displayed. We can also put a line representing the state median rent on the graph as a benchmark.

### make a figure with appropriate y axies range
fig = go.Figure(layout_yaxis_range=[700,2500])
### make the dropdown menu
buttons = []
x_val = ['2019', '2020', '2021', '2022', '2023'] # specify the x values
for row in ma_rent.index[1:]: # loop through all rows except the state median row
    buttons.append(dict(
                        label=ma_rent.loc[row,'areaname'], # get the area name
                        method='update', # specify how we will modify the chart when clicking on a button 
                        visible=True, # specify that data points will be shown
                        args=[{'y': [ma_rent.iloc[0,1:], ma_rent.iloc[row,1:]], # y values for the two lines
                               'x': [x_val,x_val], # x values for the two lines
                               'name':['State Median', ma_rent.loc[row,'areaname']] # names for the two lines
                               }
                             ]
                        )
                  )
### Draw the line graph users see initially

# draw the line for state median
fig.add_trace(go.Scatter(x=x_val,
                         y=ma_rent.iloc[0,1:],
                         name='State Median',
                         line=dict(color="darkgreen", dash="dash"))
             )


# draw the line for Barstable Town
fig.add_trace(go.Scatter(x=x_val,
                         y=ma_rent.iloc[1,1:],
                         name='Barnstable Town',
                         line=dict(color="blue", dash="dashdot"))
             )

This is the line graph the users see when they have not selected any area from the dropdown buttons. We haven’t put the dropdown buttons on the graph yet. Let’s do that.

# add the dropdown menu to the graph
fig.update_layout(
    updatemenus=[
        dict(
            active=0, #button with index 0 is active
            buttons=buttons, # add the buttons
            direction="down", # specify that it is a dropdown menu
            x=0.3,  # specify the position of the buttons along the x axis
            xanchor="left", 
            y=1.2, # specify the position of the buttons along the y axis
            yanchor="top",
            )            
            ],
    legend=
        dict(
            yanchor="top", 
            y=-0.1, # specify the position of the legend along the y axis
            xanchor="left",
            x=0.2, # specify the position of the legend along the x axis
            orientation='h'
            )
            )

Put everything together in a mock project#

From Pandas basics 1 to Pandas intermediate 3, you have learned how to do data cleaning and manipulation, how to summarize data and how to plot the data. In this section, you’ll do a small mock project that puts everything together.

Suppose you are assigned the following task: for the 10 non-US countries with the most runners in the 2019 Boston Marathon, you need to make an interactive bar chart showing the number of female and male runners from these 10 countries from 2017 Boston Marathon to 2019 Boston Marathon.

### get the data

# download the files
# Get the urls to the files and download the files
urls = ['https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2017.csv',
       'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2018.csv',
       'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2019.csv']
for url in urls:
    urllib.request.urlretrieve(url, '../data/'+url.rsplit('/')[-1][9:])
    
# Success message
print('Sample files ready.')
Sample files ready.

To make the desired interactive bar chart, we’ll need to get the number of female and the male runners from the 10 non-US countries with the most runners in 2019 Boston Marathon by year. This means, we will need to extract the relevant data and summarize it in the following way.

Data cleaning and manipulation#

Let’s first get all the data from the files that are relevant for us in this mock project.

### read the data into dfs (code example in pandas basics 1)
### explore the data to get a general idea (code example in pandas basics 1)
### get the 10 non-US countries with the most runners 
### in bm_19 (code example in pandas basics 2)


### reduce bm_19 to columns and rows of interest, 
### create a new column 'year' (code example in pandas basics 1 and 2, intermediate 2)
### reduce bm_17 and bm_18 to columns and rows of interest
### create a new column 'year'(code example in pandas basics 1 and 2, intermediate 2)
### concatenate the three reduced dfs(code example in pandas basics 3)

Data summary#

For each of the 10 non-US countries we would like to make a bar chart to show the number of female and male runners from 2017 to 2019. To achieve this purpose, we need to get the number of female and male runners from these countries by year.

### get the number of female and male runners from 
### each country by year (code example in pandas intermediate 1)

Plotting#

### make a figure with an initial bar chart for Australia
### which is alphabetically the first among the 10 countries 
### (code example in pandas intermediate 3)
### create the buttons in dropdown menu
### (code example in pandas intermediate 3)
### Add the dropdown menu to the chart
### specify how you would want to update the chart with different buttons
### (code example in pandas intermediate 3)

Lesson Complete#

Congratulations! You have completed Pandas Intermediate 3.

Exercise Solutions#

Here are a few solutions for exercises in this lesson.

Data cleaning and manipulation#

### make an interactive bar chart for boston marathon 17-19
### showing the number of female and male runners from the 
### 10 non-US countries with the most runners in 2019 

# download the files

# Get the urls to the files and download the files
urls = ['https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2017.csv',
       'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2018.csv',
       'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2019.csv']
for url in urls:
    urllib.request.urlretrieve(url, '../data/'+url.rsplit('/')[-1][9:])
    
# Success message
print('Sample files ready.')

# read the data into dfs
bm_17 = pd.read_csv('../data/BostonMarathon2017.csv')
bm_18 = pd.read_csv('../data/BostonMarathon2018.csv')
bm_19 = pd.read_csv('../data/BostonMarathon2019.csv')

# explore the data to get a general idea
Sample files ready.
# get the 10 non-US countries with the most runners in bm_19
ctry = bm_19.groupby('CountryOfResName').size().sort_values(ascending=0).iloc[1:11].sort_index().index

#reduce bm_19 to columns and rows of interest, 
#create a new column 'year'
bm_19 = bm_19.loc[bm_19['CountryOfResName'].isin(ctry),['CountryOfResName','Gender']].reset_index(drop=True).copy()
bm_19['Year']='2019'
# reduce bm_17 and bm_18 to columns and rows of interest,
# create a new column 'year' for them respectively
bm_17 = bm_17.loc[bm_17['CountryOfResName'].isin(ctry),['CountryOfResName','Gender']].reset_index(drop=True).copy()
bm_17['Year']='2017'

bm_18 = bm_18.loc[bm_18['CountryOfResName'].isin(ctry),['CountryOfResName','Gender']].reset_index(drop=True).copy()
bm_18['Year']='2018'
# concatenate the three reduced dfs
bm_17to19 = pd.concat([bm_17, bm_18, bm_19]).reset_index(drop=True)

Data summary#

### get the number of female and male runners from each country by year
bm_17to19 = bm_17to19.groupby(['CountryOfResName', 'Gender', 'Year']).size()

Plotting#

### make a figure with an initial bar chart for Australia
### which is alphabetically the first among the 10 countries
fig = go.Figure(
    data=[go.Bar(x=list(range(2017, 2020)),
            y=bm_17to19.loc[('Australia','F')],
               name='Female'
                     ),
    go.Bar(x=list(range(2017, 2020)),
            y=bm_17to19.loc[('Australia','M')],
               name='Male'
                     )
         ])

fig.show()
### create the buttons in the dropdown menu
buttons = []
x = list(range(2017,2020))
for c in ctry:
    buttons.append(dict(
                        label=c,
                        method='update',
                        visible=True,
                        args=[{'y': [bm_17to19.loc[(c,'F')],bm_17to19.loc[(c,'M')]],
                               'x': [x,x]
                               }
                             ]
                        )
                  )
### Add the dropdown menu to the chart
### specify how you would want to update the chart with different buttons
fig.update_layout(
    updatemenus=[
        dict(
            active=0, #button with index 0 is active
            buttons=buttons, # add the buttons
            direction="down", # specify it is a dropdown menu
            x=0.3, # specify where the button stands on the x axis
            xanchor="left",
            y=1.2, # specify where the button stands on the y axis
            yanchor="top",
            )            
            ],
    legend=
        dict(
            yanchor="top", 
            y=-0.1, # specify where the legend stands on the y axis
            xanchor="left",
            x=0.2, # specify where the legend stands on the x axis
            orientation='h'
            )
            )