This notebook is adapted by Zhuo Chen from the notebooks created by Nathan Kelber, William Mattingly and Melanie Walsh under Creative Commons CC BY License
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org.


Pandas Basics 2#

Description: This notebook describes how to:

  • Filter data in a dataframe

  • Work with missing values in a dataframe

  • Index a dataframe

  • Sort a dataframe

This is the second notebook in a series on learning to use Pandas.

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Beginner

Knowledge Required:

Knowledge Recommended:

Completion Time: 90 minutes

Data Format: CSV (.csv)

Libraries Used: Pandas

Research Pipeline: None


# Import pandas library, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas
import pandas as pd
# Change the display setting
pd.set_option('display.min_rows', 20) # set the minimum number of rows to display to 20

Filter data in a dataframe#

A common pipeline in data processing in Pandas is that you create a dataframe from a file and then reduce the dataframe only to the rows and columns that you are interested in.

We have learned how to use .loc and .iloc to select part of a dataframe in Pandas Basics 1. We will learn more ways to do data selection in this section.

# download the sample file
import urllib
from pathlib import Path

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_failed_banks_since_2000.csv'
file = '../data/failed_banks.csv'
urllib.request.urlretrieve(url, file)
print('Sample file ready.')

# Read in the data
banks_df = pd.read_csv(file)
banks_df
Sample file ready.
Bank Name City State Cert Acquiring Institution Closing Date Fund
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23 10543
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
... ... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02 4653
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

566 rows × 7 columns

The drop() method#

After creating a dataframe from a file, oftentimes we will drop certain rows or columns because we will not use them in the analysis anyways. In this case, we will use the .drop() method to remove those rows or columns.

We can specify which column(s) to drop by using the ‘columns’ parameter.

# Drop a column by setting the columns parameter
banks_df.drop(columns='Fund')
Bank Name City State Cert Acquiring Institution Closing Date
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19
... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00

566 rows × 6 columns

We can also drop multiple columns at one time. In this case, we just put the names for the columns we want to drop in a list and set the value of the columns parameter to the list.

# Drop multiple columns by setting the columns parameter
banks_df.drop(columns=['Fund', 'Cert'])
Bank Name City State Acquiring Institution Closing Date
0 First Republic Bank San Francisco CA JPMorgan Chase Bank, N.A. 1-May-23
1 Signature Bank New York NY Flagstar Bank, N.A. 12-Mar-23
2 Silicon Valley Bank Santa Clara CA First Citizens Bank & Trust Company 10-Mar-23
3 Almena State Bank Almena KS Equity Bank 23-Oct-20
4 First City Bank of Florida Fort Walton Beach FL United Fidelity Bank, fsb 16-Oct-20
5 The First State Bank Barboursville WV MVB Bank, Inc. 3-Apr-20
6 Ericson State Bank Ericson NE Farmers and Merchants Bank 14-Feb-20
7 City National Bank of New Jersey Newark NJ Industrial Bank 1-Nov-19
8 Resolute Bank Maumee OH Buckeye State Bank 25-Oct-19
9 Louisa Community Bank Louisa KY Kentucky Farmers Bank Corporation 25-Oct-19
... ... ... ... ... ...
556 NextBank, NA Phoenix AZ NaN 7-Feb-02
557 Oakwood Deposit Bank Co. Oakwood OH The State Bank & Trust Company 1-Feb-02
558 Bank of Sierra Blanca Sierra Blanca TX The Security State Bank of Pecos 18-Jan-02
559 Hamilton Bank, NA Miami FL Israel Discount Bank of New York 11-Jan-02
560 Sinclair National Bank Gravette AR Delta Trust & Bank 7-Sep-01
561 Superior Bank, FSB Hinsdale IL Superior Federal, FSB 27-Jul-01
562 Malta National Bank Malta OH North Valley Bank 3-May-01
563 First Alliance Bank & Trust Co. Manchester NH Southern New Hampshire Bank & Trust 2-Feb-01
564 National State Bank of Metropolis Metropolis IL Banterra Bank of Marion 14-Dec-00
565 Bank of Honolulu Honolulu HI Bank of the Orient 13-Oct-00

566 rows × 5 columns

Another way to drop a column is to give the label of the column you want to drop and then set the axis parameter to 1.

# Drop a column by setting the axis parameter
banks_df.drop('Fund', axis=1)
Bank Name City State Cert Acquiring Institution Closing Date
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19
... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00

566 rows × 6 columns

It might not be intuitive to you that axis 1 refers to the columns. Luckily, Pandas also allows us to set the axis parameter to columns when dropping columns.

# Drop a column
banks_df.drop('Fund', axis='columns')
Bank Name City State Cert Acquiring Institution Closing Date
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19
... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00

566 rows × 6 columns

You can also drop multiple columns by setting the axis parameter.

# Drop multiple columns by setting the axis parameter
banks_df.drop(['Fund', 'Cert'], axis=1)
Bank Name City State Acquiring Institution Closing Date
0 First Republic Bank San Francisco CA JPMorgan Chase Bank, N.A. 1-May-23
1 Signature Bank New York NY Flagstar Bank, N.A. 12-Mar-23
2 Silicon Valley Bank Santa Clara CA First Citizens Bank & Trust Company 10-Mar-23
3 Almena State Bank Almena KS Equity Bank 23-Oct-20
4 First City Bank of Florida Fort Walton Beach FL United Fidelity Bank, fsb 16-Oct-20
5 The First State Bank Barboursville WV MVB Bank, Inc. 3-Apr-20
6 Ericson State Bank Ericson NE Farmers and Merchants Bank 14-Feb-20
7 City National Bank of New Jersey Newark NJ Industrial Bank 1-Nov-19
8 Resolute Bank Maumee OH Buckeye State Bank 25-Oct-19
9 Louisa Community Bank Louisa KY Kentucky Farmers Bank Corporation 25-Oct-19
... ... ... ... ... ...
556 NextBank, NA Phoenix AZ NaN 7-Feb-02
557 Oakwood Deposit Bank Co. Oakwood OH The State Bank & Trust Company 1-Feb-02
558 Bank of Sierra Blanca Sierra Blanca TX The Security State Bank of Pecos 18-Jan-02
559 Hamilton Bank, NA Miami FL Israel Discount Bank of New York 11-Jan-02
560 Sinclair National Bank Gravette AR Delta Trust & Bank 7-Sep-01
561 Superior Bank, FSB Hinsdale IL Superior Federal, FSB 27-Jul-01
562 Malta National Bank Malta OH North Valley Bank 3-May-01
563 First Alliance Bank & Trust Co. Manchester NH Southern New Hampshire Bank & Trust 2-Feb-01
564 National State Bank of Metropolis Metropolis IL Banterra Bank of Marion 14-Dec-00
565 Bank of Honolulu Honolulu HI Bank of the Orient 13-Oct-00

566 rows × 5 columns

To drop a row, you just pass the index number of the row to the .drop() method.

# Drop a row 
banks_df.drop(0)
Bank Name City State Cert Acquiring Institution Closing Date Fund
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
10 The Enloe State Bank Cooper TX 10716 Legend Bank, N. A. 31-May-19 10531
... ... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02 4653
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

565 rows × 7 columns

If you assign the index number of the row to drop to the parameter index, you will get the same result.

# Drop a row by setting the index parameter
banks_df.drop(index=0)
Bank Name City State Cert Acquiring Institution Closing Date Fund
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
10 The Enloe State Bank Cooper TX 10716 Legend Bank, N. A. 31-May-19 10531
... ... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02 4653
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

565 rows × 7 columns

To drop multiple rows, we just pass a list of index numbers to the .drop() method. Note that we don’t need to specify the axis parameter. By default, Pandas knows that the list of index numbers are for the rows.

# Drop multiple rows 
banks_df.drop([0,1])
Bank Name City State Cert Acquiring Institution Closing Date Fund
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
10 The Enloe State Bank Cooper TX 10716 Legend Bank, N. A. 31-May-19 10531
11 Washington Federal Bank for Savings Chicago IL 30570 Royal Savings Bank 15-Dec-17 10530
... ... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02 4653
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

564 rows × 7 columns

Of course, you could specify the axis parameter to 0 or ‘rows’ to tell Pandas that you want to drop the rows with the given index numbers.

# Drop rows by setting the axis parameter
banks_df.drop([0,1], axis=0)
Bank Name City State Cert Acquiring Institution Closing Date Fund
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
10 The Enloe State Bank Cooper TX 10716 Legend Bank, N. A. 31-May-19 10531
11 Washington Federal Bank for Savings Chicago IL 30570 Royal Savings Bank 15-Dec-17 10530
... ... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02 4653
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

564 rows × 7 columns

# Drop rows by setting the axis parameter 
banks_df.drop([0,1], axis='rows')
Bank Name City State Cert Acquiring Institution Closing Date Fund
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
10 The Enloe State Bank Cooper TX 10716 Legend Bank, N. A. 31-May-19 10531
11 Washington Federal Bank for Savings Chicago IL 30570 Royal Savings Bank 15-Dec-17 10530
... ... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02 4653
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

564 rows × 7 columns

Note that the .drop() method only returns a copy. This means that any change you make using the .drop() method will not affect the original dataframe. To make the change permanent, you can assign the result to the variable where you store the original dataframe to update it.

Coding Challenge! < / >

In the exercises in this notebook, we’ll work on a dataset built from Constellate.

We’ll use the constellate client to automatically retrieve the metadata for a dataset. We can retrieve the metadata in a CSV file using the get_metadata method.

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
# retrieve the metadata
import constellate
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
metadata = constellate.get_metadata(dataset_id)
# Check the type of the value stored in metadata
type(metadata)
# print out the string stored in metadata
print(metadata)

The metadata is stored in a .csv file. In the following code cell, read in the data using Pandas. Give the dataframe a name. Then print out the dataframe to take a look.

# Convert the csv file to a dataframe
shake_df = pd.read_csv(metadata)
shake_df

Use a Pandas method to explore the dataframe. How many rows does it have? How many columns does it have? What is the data type of the data in each column?

# Use a Pandas method to explore the data

When you explore the Shakespeare dataframe, what did you find about the column doi? What did you find about the column placeOfPublication? Drop the two columns.

# Drop the 'doi' and 'placeOfPublication' columns

Work with missing values#

It is a common case that datasets have missing values. As you may have already noticed, blank cells in a csv file show up as NaN in a Pandas DataFrame. For example, in the dataset of failed banks, the Acquiring Institution column gives the name when a failed bank was acquired by another institution and has a value of NaN otherwise.

# Take a look at the missing values in banks_df
banks_df
Bank Name City State Cert Acquiring Institution Closing Date Fund
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23 10543
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
... ... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02 4653
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

566 rows × 7 columns

To quickly get an idea of whether a dataframe has missing values, we can always use the info() method introduced in Pandas Basics 1 to explore the data.

# Use info() to see whether there are missing values
banks_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 566 entries, 0 to 565
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Bank Name              566 non-null    object
 1   City                   566 non-null    object
 2   State                  566 non-null    object
 3   Cert                   566 non-null    int64 
 4   Acquiring Institution  535 non-null    object
 5   Closing Date           566 non-null    object
 6   Fund                   566 non-null    int64 
dtypes: int64(2), object(5)
memory usage: 31.1+ KB

In Pandas, we can also use the .isna() method to check whether a pandas series or a dataframe has missing values. What this method does is it creates a boolean mask over the data.

# Use isna() to check whether a dataframe has missing values
banks_df.isna()
Bank Name City State Cert Acquiring Institution Closing Date Fund
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
5 False False False False False False False
6 False False False False False False False
7 False False False False False False False
8 False False False False False False False
9 False False False False False False False
... ... ... ... ... ... ... ...
556 False False False False True False False
557 False False False False False False False
558 False False False False False False False
559 False False False False False False False
560 False False False False False False False
561 False False False False False False False
562 False False False False False False False
563 False False False False False False False
564 False False False False False False False
565 False False False False False False False

566 rows × 7 columns

As we can see, the cells with a non-null value are masked with the boolean value of False. The cells with a null value are masked with the boolean value of True.

Since isna() can check whether a pandas series has missing values, we can use .isna() to see whether a specific column in a dataframe has missing values.

# Use isna() to check whether a column has missing values
banks_df['Acquiring Institution'].isna()
0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
       ...  
556     True
557    False
558    False
559    False
560    False
561    False
562    False
563    False
564    False
565    False
Name: Acquiring Institution, Length: 566, dtype: bool

Drop rows and columns with missing values#

If you want to exclude the rows and columns with missing values from your data analysis, you can use the .dropna() method to do that.

By default, the .dropna() method drops the rows with at least one missing value.

# Use .dropna() to remove all rows with at least one missing value
banks_df.dropna() # no argument passed in
Bank Name City State Cert Acquiring Institution Closing Date Fund
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23 10543
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
... ... ... ... ... ... ... ...
555 Net 1st National Bank Boca Raton FL 26652 Bank Leumi USA 1-Mar-02 4654
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

535 rows × 7 columns

Based on what you have learned about the .drop() method, you now know that you can also set the axis parameter to 0 or ‘rows’ to drop the rows with missing values.

# Use .dropna() to remove all rows with at least one missing value
banks_df.dropna(axis=0) # Set the axis to 0
Bank Name City State Cert Acquiring Institution Closing Date Fund
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23 10543
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
... ... ... ... ... ... ... ...
555 Net 1st National Bank Boca Raton FL 26652 Bank Leumi USA 1-Mar-02 4654
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

535 rows × 7 columns

# Use .dropna() to remove all rows with at least one missing value
banks_df.dropna(axis='rows') # Set the axis to 'rows'
Bank Name City State Cert Acquiring Institution Closing Date Fund
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23 10543
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
... ... ... ... ... ... ... ...
555 Net 1st National Bank Boca Raton FL 26652 Bank Leumi USA 1-Mar-02 4654
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

535 rows × 7 columns

If you set the axis parameter to 1 or ‘columns’, you will drop the columns with missing values.

# Use .dropna() to remove all columns with at least one missing value
banks_df.dropna(axis=1) # set the axis to 1
Bank Name City State Cert Closing Date Fund
0 First Republic Bank San Francisco CA 59017 1-May-23 10543
1 Signature Bank New York NY 57053 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 25-Oct-19 10532
... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 7-Feb-02 4653
557 Oakwood Deposit Bank Co. Oakwood OH 8966 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 13-Oct-00 4645

566 rows × 6 columns

# Use .dropna() to remove all columns with at least one missing value
banks_df.dropna(axis='columns') # set the axis to 'columns'
Bank Name City State Cert Closing Date Fund
0 First Republic Bank San Francisco CA 59017 1-May-23 10543
1 Signature Bank New York NY 57053 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 25-Oct-19 10532
... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 7-Feb-02 4653
557 Oakwood Deposit Bank Co. Oakwood OH 8966 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 13-Oct-00 4645

566 rows × 6 columns

Sometimes we would want to drop a row only if that row has a missing value in a specific column. We can use the subset parameter to specify the column(s) to look for missing values.

# Specify the columns to look for missing values
banks_df.dropna(subset=['Acquiring Institution', 'City'])
Bank Name City State Cert Acquiring Institution Closing Date Fund
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23 10543
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532
... ... ... ... ... ... ... ...
555 Net 1st National Bank Boca Raton FL 26652 Bank Leumi USA 1-Mar-02 4654
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645

535 rows × 7 columns

Note that the .dropna() method only returns a copy. This means that any change you make using the .dropna() method will not affect the original dataframe. To make the change permanent, you can assign the result to the variable where you store the original dataframe to update it.

Oftentimes, you would want to maintain the rows and columns that have missing values. However, you would want to fill the cells with NaN values with some values which are of the same data type as the other cells in the same column. In this way, when you apply a certain function to a column in a dataframe, you will not run into type error. A common practice to deal with this kind of problem is to use the .fillna() method.

# Fill the missing values
banks_df['Acquiring Institution'].fillna('No Acquirer')
0                JPMorgan Chase Bank, N.A.
1                      Flagstar Bank, N.A.
2      First Citizens Bank & Trust Company
3                              Equity Bank
4                United Fidelity Bank, fsb
5                           MVB Bank, Inc.
6               Farmers and Merchants Bank
7                          Industrial Bank
8                       Buckeye State Bank
9        Kentucky Farmers Bank Corporation
                      ...                 
556                            No Acquirer
557         The State Bank & Trust Company
558       The Security State Bank of Pecos
559       Israel Discount Bank of New York
560                     Delta Trust & Bank
561                  Superior Federal, FSB
562                      North Valley Bank
563    Southern New Hampshire Bank & Trust
564                Banterra Bank of Marion
565                     Bank of the Orient
Name: Acquiring Institution, Length: 566, dtype: object

Coding Challenge! < / >

Let’s grab the Shakespeare dataframe we have created and do some further filtering.

When you explore the Shakespeare dataframe, what did you find about the column pagestart? What did you find about the column pageEnd? Suppose you will need the page range information in your analysis. Therefore, you will want to drop any rows with missing values in these two columns. How do you do it?

# Drop any rows with missing values in 'pageStart' or 'pageEnd'

Filter data using conditionals#

Conditional selection using df.loc[] is a very common method to filter a dataframe.

You write a filtering condition to filter a target column. The condition then checks, for each cell in the target column, whether it fulfills the condition or not. The results will be returned as a Series of True/False values. The .loc indexer then uses this Series to select the rows that have True values.

Suppose you are interested in the banks that failed since 2000 in the state of Georgia. From the original dataframe, you would like to get all the rows of the failed banks in Georgia. How do you do it?

# Write a filtering condition
banks_df['State'] == 'GA' # Create a boolean mask over the column 'State'
0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
       ...  
556    False
557    False
558    False
559    False
560    False
561    False
562    False
563    False
564    False
565    False
Name: State, Length: 566, dtype: bool
# Assign the filtering condition to a variable
filt = (banks_df['State'] == 'GA') # Use parenthesis for better reading
# Put the Series returned by the filtering condition within the hard brackets of banks_df.loc[]
banks_df.loc[filt]
Bank Name City State Cert Acquiring Institution Closing Date Fund
20 The Woodbury Banking Company Woodbury GA 11297 United Bank 19-Aug-16 10521
25 The Bank of Georgia Peachtree City GA 35259 Fidelity Bank 2-Oct-15 10516
29 Capitol City Bank & Trust Company Atlanta GA 33938 First-Citizens Bank & Trust Company 13-Feb-15 10512
37 Eastside Commercial Bank Conyers GA 58125 Community & Southern Bank 18-Jul-14 10504
62 Sunrise Bank Valdosta GA 58185 Synovus Bank 10-May-13 10481
64 Douglas County Bank Douglasville GA 21649 Hamilton State Bank 26-Apr-13 10476
70 Frontier Bank LaGrange GA 16431 HeritageBank of the South 8-Mar-13 10471
75 Hometown Community Bank Braselton GA 57928 CertusBank, National Association 16-Nov-12 10466
86 Jasper Banking Company Jasper GA 16240 Stearns Bank N.A. 27-Jul-12 10455
89 First Cherokee State Bank Woodstock GA 32711 Community & Southern Bank 20-Jul-12 10450
... ... ... ... ... ... ... ...
496 FirstCity Bank Stockbridge GA 18243 NaN 20-Mar-09 10047
497 Freedom Bank of Georgia Commerce GA 57558 Northeast Georgia Bank 6-Mar-09 10044
507 FirstBank Financial Services McDonough GA 57017 Regions Bank 6-Feb-09 10036
515 Haven Trust Bank Duluth GA 35379 Branch Banking & Trust Company, (BB&T) 12-Dec-08 10027
516 First Georgia Community Bank Jackson GA 34301 United Bank 5-Dec-08 10025
519 Community Bank Loganville GA 16490 Bank of Essex 21-Nov-08 10022
523 Alpha Bank & Trust Alpharetta GA 58241 Stearns Bank, N.A. 24-Oct-08 10018
529 Integrity Bank Alpharetta GA 35469 Regions Bank 29-Aug-08 10012
540 NetBank Alpharetta GA 32575 ING DIRECT 28-Sep-07 10001
551 AmTrade International Bank Atlanta GA 33784 NaN 30-Sep-02 4657

93 rows × 7 columns

Out of the rows that fulfill the filtering condition, we can further specify which columns to be returned.

# Specify a single column to be returned
banks_df.loc[filt, 'Bank Name']
20          The Woodbury Banking Company
25                   The Bank of Georgia
29     Capitol City Bank & Trust Company
37              Eastside Commercial Bank
62                          Sunrise Bank
64                   Douglas County Bank
70                         Frontier Bank
75               Hometown Community Bank
86                Jasper Banking Company
89             First Cherokee State Bank
                     ...                
496                       FirstCity Bank
497              Freedom Bank of Georgia
507         FirstBank Financial Services
515                     Haven Trust Bank
516         First Georgia Community Bank
519                       Community Bank
523                   Alpha Bank & Trust
529                       Integrity Bank
540                              NetBank
551           AmTrade International Bank
Name: Bank Name, Length: 93, dtype: object

Of course, we can select muliple columns to be returned out of the filtered rows.

# Specify multiple columns to be returned
banks_df.loc[filt, ['Bank Name', 'Fund']]
Bank Name Fund
20 The Woodbury Banking Company 10521
25 The Bank of Georgia 10516
29 Capitol City Bank & Trust Company 10512
37 Eastside Commercial Bank 10504
62 Sunrise Bank 10481
64 Douglas County Bank 10476
70 Frontier Bank 10471
75 Hometown Community Bank 10466
86 Jasper Banking Company 10455
89 First Cherokee State Bank 10450
... ... ...
496 FirstCity Bank 10047
497 Freedom Bank of Georgia 10044
507 FirstBank Financial Services 10036
515 Haven Trust Bank 10027
516 First Georgia Community Bank 10025
519 Community Bank 10022
523 Alpha Bank & Trust 10018
529 Integrity Bank 10012
540 NetBank 10001
551 AmTrade International Bank 4657

93 rows × 2 columns

Now suppose you want to get all the failed banks whose name contains the word ‘community’.

# Get all the banks with the word 'community' in their name
filt = (banks_df['Bank Name'].str.contains('Community'))
banks_df.loc[filt, ['Bank Name']]
Bank Name
9 Louisa Community Bank
18 Harvest Community Bank
30 Highland Community Bank
50 Texas Community Bank, National Association
53 The Community's Bank
55 Community South Bank
57 First Community Bank of Southwest Florida (als...
63 Pisgah Community Bank
66 Chipola Community Bank
73 Westside Community Bank
... ...
442 Community First Bank
443 Community National Bank of Sarasota County
447 Peoples Community Bank
472 Neighborhood Community Bank
473 Community Bank of West Georgia
476 Southern Community Bank
483 Citizens Community Bank
499 Heritage Community Bank
516 First Georgia Community Bank
519 Community Bank

63 rows × 1 columns

Conjunction of multiple filtering conditions: &#

Oftentimes, you would want to filter a dataframe based on more complex conditions. For example, suppose you would like to get the banks in GA that were closed between 2008 and 2010. How do you use df.loc[ ] to achieve it?

The location of the failed banks is stored in the State column. The closing year of the banks is stored in the Closing Date column.

# Create the first filtering condition restricting the state
filt1 = (banks_df['State'] == 'GA')

How to get the closing year of the banks?

# Get the data type of the 'Closing Date' column
banks_df['Closing Date'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 566 entries, 0 to 565
Series name: Closing Date
Non-Null Count  Dtype 
--------------  ----- 
566 non-null    object
dtypes: object(1)
memory usage: 4.6+ KB
# Create a new column storing the closing year of the banks
banks_df['Closing Year'] = banks_df['Closing Date'].str[-2:].astype(int) + 2000
banks_df
Bank Name City State Cert Acquiring Institution Closing Date Fund Closing Year
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23 10543 2023
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540 2023
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539 2023
3 Almena State Bank Almena KS 15426 Equity Bank 23-Oct-20 10538 2020
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537 2020
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536 2020
6 Ericson State Bank Ericson NE 18265 Farmers and Merchants Bank 14-Feb-20 10535 2020
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534 2019
8 Resolute Bank Maumee OH 58317 Buckeye State Bank 25-Oct-19 10533 2019
9 Louisa Community Bank Louisa KY 58112 Kentucky Farmers Bank Corporation 25-Oct-19 10532 2019
... ... ... ... ... ... ... ... ...
556 NextBank, NA Phoenix AZ 22314 NaN 7-Feb-02 4653 2002
557 Oakwood Deposit Bank Co. Oakwood OH 8966 The State Bank & Trust Company 1-Feb-02 4652 2002
558 Bank of Sierra Blanca Sierra Blanca TX 22002 The Security State Bank of Pecos 18-Jan-02 4651 2002
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650 2002
560 Sinclair National Bank Gravette AR 34248 Delta Trust & Bank 7-Sep-01 4649 2001
561 Superior Bank, FSB Hinsdale IL 32646 Superior Federal, FSB 27-Jul-01 6004 2001
562 Malta National Bank Malta OH 6629 North Valley Bank 3-May-01 4648 2001
563 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New Hampshire Bank & Trust 2-Feb-01 4647 2001
564 National State Bank of Metropolis Metropolis IL 3815 Banterra Bank of Marion 14-Dec-00 4646 2000
565 Bank of Honolulu Honolulu HI 21029 Bank of the Orient 13-Oct-00 4645 2000

566 rows × 8 columns

# Create the second filtering condition restricting the closing year
filt2 = (banks_df['Closing Year'] > 2007) & (banks_df['Closing Year'] < 2011)

With the two filtering conditions, we are ready to extract the banks in GA that failed between 2008 and 2010.

# Use filt1 and filt2 to get the target rows
banks_df.loc[filt1 & filt2]
Bank Name City State Cert Acquiring Institution Closing Date Fund Closing Year
219 United Americas Bank, N.A. Atlanta GA 35065 State Bank and Trust Company 17-Dec-10 10323 2010
220 Appalachian Community Bank, FSB McCaysville GA 58495 Peoples Bank of East Tennessee 17-Dec-10 10319 2010
221 Chestatee State Bank Dawsonville GA 34578 Bank of the Ozarks 17-Dec-10 10320 2010
229 Darby Bank & Trust Co. Vidalia GA 14580 Ameris Bank 12-Nov-10 10312 2010
230 Tifton Banking Company Tifton GA 57831 Ameris Bank 12-Nov-10 10313 2010
238 The First National Bank of Barnesville Barnesville GA 2119 United Bank 22-Oct-10 10304 2010
239 The Gordon Bank Gordon GA 33904 Morris Bank 22-Oct-10 10305 2010
251 The Peoples Bank Winder GA 182 Community & Southern Bank 17-Sep-10 10292 2010
252 First Commerce Community Bank Douglasville GA 57448 Community & Southern Bank 17-Sep-10 10289 2010
253 Bank of Ellijay Ellijay GA 58197 Community & Southern Bank 17-Sep-10 10287 2010
270 Northwest Bank & Trust Acworth GA 57658 State Bank and Trust Company 30-Jul-10 10274 2010
276 Crescent Bank and Trust Company Jasper GA 27559 Renasant Bank 23-Jul-10 10265 2010
289 First National Bank Savannah GA 34152 The Savannah Bank, N.A. 25-Jun-10 10251 2010
305 Satilla Community Bank Saint Marys GA 35114 Ameris Bank 14-May-10 10238 2010
334 Unity National Bank Cartersville GA 34678 Bank of the Ozarks 26-Mar-10 10208 2010
336 McIntosh Commercial Bank Carrollton GA 57399 CharterBank 26-Mar-10 10207 2010
339 Bank of Hiawassee Hiawassee GA 10054 Citizens South Bank 19-Mar-10 10202 2010
340 Appalachian Community Bank Ellijay GA 33989 Community & Southern Bank 19-Mar-10 10199 2010
342 Century Security Bank Duluth GA 58104 Bank of Upson 19-Mar-10 10198 2010
361 Community Bank and Trust Cornelia GA 5702 SCBT National Association 29-Jan-10 10180 2010
364 First National Bank of Georgia Carrollton GA 16480 Community & Southern Bank 29-Jan-10 10179 2010
380 RockBridge Commercial Bank Atlanta GA 58315 NaN 18-Dec-09 10164 2009
387 The Tattnall Bank Reidsville GA 12080 Heritage Bank of the South 4-Dec-09 10153 2009
388 First Security National Bank Norcross GA 26290 State Bank and Trust Company 4-Dec-09 10157 2009
389 The Buckhead Community Bank Atlanta GA 34663 State Bank and Trust Company 4-Dec-09 10152 2009
398 United Security Bank Sparta GA 22286 Ameris Bank 6-Nov-09 10145 2009
413 American United Bank Lawrenceville GA 57794 Ameris Bank 23-Oct-09 10127 2009
419 Georgian Bank Atlanta GA 57151 First Citizens Bank and Trust Company, Inc. 25-Sep-09 10122 2009
435 First Coweta Bank Newnan GA 57702 United Bank 21-Aug-09 10108 2009
436 ebank Atlanta GA 34682 Stearns Bank, N.A. 21-Aug-09 10107 2009
450 Security Bank of Jones County Gray GA 8486 State Bank and Trust Company 24-Jul-09 10088 2009
451 Security Bank of Houston County Perry GA 27048 State Bank and Trust Company 24-Jul-09 10087 2009
452 Security Bank of Bibb County Macon GA 27367 State Bank and Trust Company 24-Jul-09 10085 2009
453 Security Bank of North Metro Woodstock GA 57105 State Bank and Trust Company 24-Jul-09 10090 2009
454 Security Bank of North Fulton Alpharetta GA 57430 State Bank and Trust Company 24-Jul-09 10089 2009
455 Security Bank of Gwinnett County Suwanee GA 57346 State Bank and Trust Company 24-Jul-09 10086 2009
460 First Piedmont Bank Winder GA 34594 First American Bank and Trust Company 17-Jul-09 10084 2009
472 Neighborhood Community Bank Newnan GA 35285 CharterBank 26-Jun-09 10069 2009
473 Community Bank of West Georgia Villa Rica GA 57436 NaN 26-Jun-09 10068 2009
476 Southern Community Bank Fayetteville GA 35251 United Community Bank 19-Jun-09 10067 2009
484 Silverton Bank, NA Atlanta GA 26535 NaN 1-May-09 10059 2009
488 American Southern Bank Kennesaw GA 57943 Bank of North Georgia 24-Apr-09 10053 2009
493 Omni National Bank Atlanta GA 22238 NaN 27-Mar-09 10048 2009
496 FirstCity Bank Stockbridge GA 18243 NaN 20-Mar-09 10047 2009
497 Freedom Bank of Georgia Commerce GA 57558 Northeast Georgia Bank 6-Mar-09 10044 2009
507 FirstBank Financial Services McDonough GA 57017 Regions Bank 6-Feb-09 10036 2009
515 Haven Trust Bank Duluth GA 35379 Branch Banking & Trust Company, (BB&T) 12-Dec-08 10027 2008
516 First Georgia Community Bank Jackson GA 34301 United Bank 5-Dec-08 10025 2008
519 Community Bank Loganville GA 16490 Bank of Essex 21-Nov-08 10022 2008
523 Alpha Bank & Trust Alpharetta GA 58241 Stearns Bank, N.A. 24-Oct-08 10018 2008
529 Integrity Bank Alpharetta GA 35469 Regions Bank 29-Aug-08 10012 2008

Note that when we extract rows that fulfill multiple conditions, we use & in Pandas, not and. If you replace & with and, you will get an error. This is different than what we have learned about boolean operators in Python basics 2. In Python, we use and, or and not. In Pandas, we use &, | and ~ intead.

Pandas Operator

Boolean

Requires

&

and

All required to True

|

or

If any are True

~

not

The opposite

Although we use different symbols for these boolean operators, the truth table for them stays the same. For a quick review of the truth table, see Python basics 2.

Disjunction of multiple filtering conditions: |#

Suppose you would like to take a look at all the failed banks in the state of Georgia or the state of New York. How do you use df.loc[] to get the target rows?

# Create the two filtering conditions restricting the state to GA and NY
filt1 = (banks_df['State'] == 'GA')
filt2 = (banks_df['State'] == 'NY')
# Use filt1 and filt2 to get the target rows
banks_df.loc[filt1|filt2]
Bank Name City State Cert Acquiring Institution Closing Date Fund Closing Year
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540 2023
20 The Woodbury Banking Company Woodbury GA 11297 United Bank 19-Aug-16 10521 2016
25 The Bank of Georgia Peachtree City GA 35259 Fidelity Bank 2-Oct-15 10516 2015
29 Capitol City Bank & Trust Company Atlanta GA 33938 First-Citizens Bank & Trust Company 13-Feb-15 10512 2015
37 Eastside Commercial Bank Conyers GA 58125 Community & Southern Bank 18-Jul-14 10504 2014
62 Sunrise Bank Valdosta GA 58185 Synovus Bank 10-May-13 10481 2013
64 Douglas County Bank Douglasville GA 21649 Hamilton State Bank 26-Apr-13 10476 2013
70 Frontier Bank LaGrange GA 16431 HeritageBank of the South 8-Mar-13 10471 2013
75 Hometown Community Bank Braselton GA 57928 CertusBank, National Association 16-Nov-12 10466 2012
86 Jasper Banking Company Jasper GA 16240 Stearns Bank N.A. 27-Jul-12 10455 2012
... ... ... ... ... ... ... ... ...
497 Freedom Bank of Georgia Commerce GA 57558 Northeast Georgia Bank 6-Mar-09 10044 2009
507 FirstBank Financial Services McDonough GA 57017 Regions Bank 6-Feb-09 10036 2009
515 Haven Trust Bank Duluth GA 35379 Branch Banking & Trust Company, (BB&T) 12-Dec-08 10027 2008
516 First Georgia Community Bank Jackson GA 34301 United Bank 5-Dec-08 10025 2008
519 Community Bank Loganville GA 16490 Bank of Essex 21-Nov-08 10022 2008
523 Alpha Bank & Trust Alpharetta GA 58241 Stearns Bank, N.A. 24-Oct-08 10018 2008
529 Integrity Bank Alpharetta GA 35469 Regions Bank 29-Aug-08 10012 2008
540 NetBank Alpharetta GA 32575 ING DIRECT 28-Sep-07 10001 2007
543 Reliance Bank White Plains NY 26778 Union State Bank 19-Mar-04 4664 2004
551 AmTrade International Bank Atlanta GA 33784 NaN 30-Sep-02 4657 2002

99 rows × 8 columns

If you would like to get the data of the failed banks in the following six states — Georgia, New York, New Jersey, Florida, California and West Virginia, you will not want to write six filtering conditions and use the vertical bar | to connect all of them. That would be too repetitive. In this case, we can use the .isin() method to create a filtering condition.

# Create a list of the states
states = ['GA', 'NY', 'NJ', 'FL', 'CA', 'WV']
# Create a filtering condition
filt = (banks_df['State'].isin(states))
# Use filt to find all failed banks in the six states
banks_df.loc[filt]
Bank Name City State Cert Acquiring Institution Closing Date Fund Closing Year
0 First Republic Bank San Francisco CA 59017 JPMorgan Chase Bank, N.A. 1-May-23 10543 2023
1 Signature Bank New York NY 57053 Flagstar Bank, N.A. 12-Mar-23 10540 2023
2 Silicon Valley Bank Santa Clara CA 24735 First Citizens Bank & Trust Company 10-Mar-23 10539 2023
4 First City Bank of Florida Fort Walton Beach FL 16748 United Fidelity Bank, fsb 16-Oct-20 10537 2020
5 The First State Bank Barboursville WV 14361 MVB Bank, Inc. 3-Apr-20 10536 2020
7 City National Bank of New Jersey Newark NJ 21111 Industrial Bank 1-Nov-19 10534 2019
18 Harvest Community Bank Pennsville NJ 34951 First-Citizens Bank & Trust Company 13-Jan-17 10523 2017
20 The Woodbury Banking Company Woodbury GA 11297 United Bank 19-Aug-16 10521 2016
25 The Bank of Georgia Peachtree City GA 35259 Fidelity Bank 2-Oct-15 10516 2015
29 Capitol City Bank & Trust Company Atlanta GA 33938 First-Citizens Bank & Trust Company 13-Feb-15 10512 2015
... ... ... ... ... ... ... ... ...
532 First Heritage Bank, NA Newport Beach CA 57961 Mutual of Omaha Bank 25-Jul-08 10009 2008
534 IndyMac Bank Pasadena CA 29730 OneWest Bank, FSB 11-Jul-08 10007 2008
540 NetBank Alpharetta GA 32575 ING DIRECT 28-Sep-07 10001 2007
543 Reliance Bank White Plains NY 26778 Union State Bank 19-Mar-04 4664 2004
544 Guaranty National Bank of Tallahassee Tallahassee FL 26838 Hancock Bank of Florida 12-Mar-04 4663 2004
545 Dollar Savings Bank Newark NJ 31330 NaN 14-Feb-04 6006 2004
548 Southern Pacific Bank Torrance CA 27094 Beal Bank 7-Feb-03 4660 2003
551 AmTrade International Bank Atlanta GA 33784 NaN 30-Sep-02 4657 2002
555 Net 1st National Bank Boca Raton FL 26652 Bank Leumi USA 1-Mar-02 4654 2002
559 Hamilton Bank, NA Miami FL 24382 Israel Discount Bank of New York 11-Jan-02 4650 2002

228 rows × 8 columns

Negation of a certain condition:~#

Now, suppose you would like to get all the failed banks that were not closed in 2008. How do you do it?

# Create the filtering condition restricting the closing year to non-2008
filt = (~(banks_df['Closing Year'] == 2008))
# Use the filtering condition to get the target rows with specified columns
banks_df.loc[filt, ['Bank Name', 'City']]
Bank Name City
0 First Republic Bank San Francisco
1 Signature Bank New York
2 Silicon Valley Bank Santa Clara
3 Almena State Bank Almena
4 First City Bank of Florida Fort Walton Beach
5 The First State Bank Barboursville
6 Ericson State Bank Ericson
7 City National Bank of New Jersey Newark
8 Resolute Bank Maumee
9 Louisa Community Bank Louisa
... ... ...
556 NextBank, NA Phoenix
557 Oakwood Deposit Bank Co. Oakwood
558 Bank of Sierra Blanca Sierra Blanca
559 Hamilton Bank, NA Miami
560 Sinclair National Bank Gravette
561 Superior Bank, FSB Hinsdale
562 Malta National Bank Malta
563 First Alliance Bank & Trust Co. Manchester
564 National State Bank of Metropolis Metropolis
565 Bank of Honolulu Honolulu

541 rows × 2 columns

Coding Challenge! < / >

Let’s do some filtering!

From the Shakespeare dataframe, get the title and the creator of the documents published between 2000 and 2010.

# get the title and creator of docs published between 2000 and 2010

From the Shakespeare dataframe, get the creator of the documents shorter than 10 pages or longer than 50 pages.

# get the creator of the docs shorter than 10 pages or longer than 50 pages

From the Shakespeare dataframe, get the title of the documents whose publisher is not Folger Shakespeare Library.

# get the title of the docs whose publisher is NOT Folger Shakespeare Library

Index a dataframe#

In this section, we’ll continue to work with the dataframe we created in Pandas 1 storing data on the most recent 10 World Cup games.

# Create a dataframe with world cup data
wcup = pd.DataFrame({"Year": [2022, 
                              2018, 
                              2014, 
                              2010, 
                              2006, 
                              2002, 
                              1998, 
                              1994, 
                              1990,
                              1986], 
                     "Champion": ["Argentina", 
                                  "France", 
                                  "Germany", 
                                  "Spain", 
                                  "Italy", 
                                  "Brazil", 
                                  "France", 
                                  "Brazil", 
                                  "Germany", 
                                  "Argentina"], 
                     "Host": ["Qatar", 
                              "Russia", 
                              "Brazil", 
                              "South Africa", 
                              "Germany", 
                              "Korea/Japan", 
                              "France", 
                              "USA", 
                              "Italy", 
                              "Mexico"],
                     "Score": ["7-5", 
                               "4-2", 
                               "1-0", 
                               "1-0", 
                               "6-4", 
                               "2-0", 
                               "3-0", 
                               "3-2", 
                               "1-0", 
                               "3-2"]
                    })
wcup['Goals Scored'] = wcup['Score'].str[0]
wcup['Goals Conceded'] = wcup['Score'].str[-1]
wcup['Difference'] = wcup['Goals Scored'].astype(int) - wcup['Goals Conceded'].astype(int)
wcup
Year Champion Host Score Goals Scored Goals Conceded Difference
0 2022 Argentina Qatar 7-5 7 5 2
1 2018 France Russia 4-2 4 2 2
2 2014 Germany Brazil 1-0 1 0 1
3 2010 Spain South Africa 1-0 1 0 1
4 2006 Italy Germany 6-4 6 4 2
5 2002 Brazil Korea/Japan 2-0 2 0 2
6 1998 France France 3-0 3 0 3
7 1994 Brazil USA 3-2 3 2 1
8 1990 Germany Italy 1-0 1 0 1
9 1986 Argentina Mexico 3-2 3 2 1

Set, reset and use indexes#

We have seen that by default, the rows in a dataframe are numbered by integer indexes starting from 0. The indexes look like a column to the far left without a name.

We can set the index column to one of the columns in the dataframe. This is desirable because a range of integers is not descriptive but a column with a name is descriptive. When we want to locate specific data, descriptive labels are much more useful.

# set the index column to 'Year'
wcup.set_index('Year')
Champion Host Score Goals Scored Goals Conceded Difference
Year
2022 Argentina Qatar 7-5 7 5 2
2018 France Russia 4-2 4 2 2
2014 Germany Brazil 1-0 1 0 1
2010 Spain South Africa 1-0 1 0 1
2006 Italy Germany 6-4 6 4 2
2002 Brazil Korea/Japan 2-0 2 0 2
1998 France France 3-0 3 0 3
1994 Brazil USA 3-2 3 2 1
1990 Germany Italy 1-0 1 0 1
1986 Argentina Mexico 3-2 3 2 1

Take a look at the original dataframe, is it changed?

# Take a look at the original dataframe
wcup
Year Champion Host Score Goals Scored Goals Conceded Difference
0 2022 Argentina Qatar 7-5 7 5 2
1 2018 France Russia 4-2 4 2 2
2 2014 Germany Brazil 1-0 1 0 1
3 2010 Spain South Africa 1-0 1 0 1
4 2006 Italy Germany 6-4 6 4 2
5 2002 Brazil Korea/Japan 2-0 2 0 2
6 1998 France France 3-0 3 0 3
7 1994 Brazil USA 3-2 3 2 1
8 1990 Germany Italy 1-0 1 0 1
9 1986 Argentina Mexico 3-2 3 2 1

The original dataframe is NOT changed after we use the .set_index() method to change the index column. The .set_index() method returns a copy, this is why the original dataframe is not affected.

If you want to make the change permanent, you can assign the returned object to the variable where you store the original dataframe to update it.

# Change the index column and commit the change
wcup = wcup.set_index('Year')
wcup
Champion Host Score Goals Scored Goals Conceded Difference
Year
2022 Argentina Qatar 7-5 7 5 2
2018 France Russia 4-2 4 2 2
2014 Germany Brazil 1-0 1 0 1
2010 Spain South Africa 1-0 1 0 1
2006 Italy Germany 6-4 6 4 2
2002 Brazil Korea/Japan 2-0 2 0 2
1998 France France 3-0 3 0 3
1994 Brazil USA 3-2 3 2 1
1990 Germany Italy 1-0 1 0 1
1986 Argentina Mexico 3-2 3 2 1

You can change the index column back to the default integer index column using the reset_index() method.

# reset the indexes to the integer indexes
wcup.reset_index()
Year Champion Host Score Goals Scored Goals Conceded Difference
0 2022 Argentina Qatar 7-5 7 5 2
1 2018 France Russia 4-2 4 2 2
2 2014 Germany Brazil 1-0 1 0 1
3 2010 Spain South Africa 1-0 1 0 1
4 2006 Italy Germany 6-4 6 4 2
5 2002 Brazil Korea/Japan 2-0 2 0 2
6 1998 France France 3-0 3 0 3
7 1994 Brazil USA 3-2 3 2 1
8 1990 Germany Italy 1-0 1 0 1
9 1986 Argentina Mexico 3-2 3 2 1

Sort a dataframe#

A common use of the index column is to sort a dataframe. Here, we have ‘Year’, a numerical column as our index column. When we sort the indexes, by default, the dataframe will be sorted by the index column in an ascending order.

# Sort the dataframe by the index column
wcup.sort_index()
Champion Host Score Goals Scored Goals Conceded Difference
Year
1986 Argentina Mexico 3-2 3 2 1
1990 Germany Italy 1-0 1 0 1
1994 Brazil USA 3-2 3 2 1
1998 France France 3-0 3 0 3
2002 Brazil Korea/Japan 2-0 2 0 2
2006 Italy Germany 6-4 6 4 2
2010 Spain South Africa 1-0 1 0 1
2014 Germany Brazil 1-0 1 0 1
2018 France Russia 4-2 4 2 2
2022 Argentina Qatar 7-5 7 5 2

You could set the parameter ascending=False to sort the indexes in a descending order.

# Specify the ascending order
wcup.sort_index(ascending=False)
Champion Host Score Goals Scored Goals Conceded Difference
Year
2022 Argentina Qatar 7-5 7 5 2
2018 France Russia 4-2 4 2 2
2014 Germany Brazil 1-0 1 0 1
2010 Spain South Africa 1-0 1 0 1
2006 Italy Germany 6-4 6 4 2
2002 Brazil Korea/Japan 2-0 2 0 2
1998 France France 3-0 3 0 3
1994 Brazil USA 3-2 3 2 1
1990 Germany Italy 1-0 1 0 1
1986 Argentina Mexico 3-2 3 2 1

Sort by one column#

We can sort the entire dataframe by a column other than the index column. The .sort_values() method helps us do it.

# Sort the dataframe by the column 'Goals Scored'
wcup.sort_values(by=['Goals Scored'])
Champion Host Score Goals Scored Goals Conceded Difference
Year
2014 Germany Brazil 1-0 1 0 1
2010 Spain South Africa 1-0 1 0 1
1990 Germany Italy 1-0 1 0 1
2002 Brazil Korea/Japan 2-0 2 0 2
1998 France France 3-0 3 0 3
1994 Brazil USA 3-2 3 2 1
1986 Argentina Mexico 3-2 3 2 1
2018 France Russia 4-2 4 2 2
2006 Italy Germany 6-4 6 4 2
2022 Argentina Qatar 7-5 7 5 2

Sort by multiple columns#

It is a convention to sort the soccer results first by difference (i.e. how many more goals the champion scored than the runner-up) and then by goals conceded (i.e. how many goals the champion lost). Pandas can easily do that.

# Sort the dataframe by Difference column in descending order 
# then by Goals Conceded column in ascending order
wcup.sort_values(by=['Difference', 'Goals Conceded'], ascending=[False, True])
Champion Host Score Goals Scored Goals Conceded Difference
Year
1998 France France 3-0 3 0 3
2002 Brazil Korea/Japan 2-0 2 0 2
2018 France Russia 4-2 4 2 2
2006 Italy Germany 6-4 6 4 2
2022 Argentina Qatar 7-5 7 5 2
2014 Germany Brazil 1-0 1 0 1
2010 Spain South Africa 1-0 1 0 1
1990 Germany Italy 1-0 1 0 1
1994 Brazil USA 3-2 3 2 1
1986 Argentina Mexico 3-2 3 2 1
# Reset the index for later use
wcup = wcup.reset_index()

Hierarchical indexing#

Sorting by multiple columns helps us group the data in a certain way. For example, in the world cup dataframe, if we would like to group the data first by the champions and then by the years, we can sort the dataframe by these two columns and set the two columns as the a multi-level index of the dataframe. The Champion column will be the level 0 index and the Year column will be the level 1 index. The hierarchical indexing allows us to work with higher dimension data.

# First,sort the two columns we will use for multi-level indexing
# Then, set the index to the composite of Champion and Year
wcup = wcup.sort_values(by=['Champion', 'Year'], ascending=[1,0]).set_index(['Champion', 'Year'])
wcup
Host Score Goals Scored Goals Conceded Difference
Champion Year
Argentina 2022 Qatar 7-5 7 5 2
1986 Mexico 3-2 3 2 1
Brazil 2002 Korea/Japan 2-0 2 0 2
1994 USA 3-2 3 2 1
France 2018 Russia 4-2 4 2 2
1998 France 3-0 3 0 3
Germany 2014 Brazil 1-0 1 0 1
1990 Italy 1-0 1 0 1
Italy 2006 Germany 6-4 6 4 2
Spain 2010 South Africa 1-0 1 0 1

If a dataframe has a multi-level index, to access a certain row, we will need to provide a multi-level index in order to access it.

# Get the data on the 1986 world cup game won by Argentina
wcup.loc[('Argentina', 1986)]
Host              Mexico
Score                3-2
Goals Scored           3
Goals Conceded         2
Difference             1
Name: (Argentina, 1986), dtype: object

Coding Challenge! < / >

In this coding challenge, we’ll use the Shakespeare dataframe. Use what you have learned about sorting and indexing a dataframe to answer the following questions.

# Take a look at the Shakespeare df
shake_df

Which doc(s) are the longest in this dataset?

# Get the longest doc(s) in the df

Among the docs published in 1983, how many were published by Folger Shakespeare Library?

# Among the docs published in 1983, how many were published by Folger Shakespeare Library

Lesson Complete#

Congratulations! You have completed Pandas Basics 2.

Start Next Lesson: Pandas 3 ->#

Exercise Solutions#

Here are a few solutions for exercises in this lesson.

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
# retrieve the metadata
import constellate
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
metadata = constellate.get_metadata(dataset_id)

# Create a dataframe
shake_df = pd.read_csv(metadata)

# Explore the dataframe
shake_df.info()

# Drop the columns of doi and the placeofPublication, make the change permanent
shake_df = shake_df.drop(columns=['doi', 'placeOfPublication'])

# Drop the rows with missing values in 'pageStart' or 'pageEnd'
shake_df = shake_df.dropna(subset=['pageStart', 'pageEnd'])

shake_df
# get the title and the creator of the documents published between 2000 and 2010
filt = (shake_df['publicationYear']>1999) & (shake_df['publicationYear']<2011)
shake_df.loc[filt, ['title', 'creator']]
# get the creator of the documents shorter than 10 pages or longer than 50 pages
filt = (shake_df['pageCount']<10)|(shake_df['pageCount']>50)
shake_df.loc[filt, 'creator']
# get the title of the documents whose publisher is not Folger Shakespeare Library
filt = (shake_df['publisher']=='Folger Shakespeare Library')
shake_df.loc[~filt, 'title']
# get the longest doc(s) in the dataset
shake_df.loc[:, ['title', 'pageCount']].sort_values(by='pageCount', ascending=False)
# get the number of docs published by Folger Shakespeare Library in 1983
shake_df.set_index(['publicationYear', 'publisher']).sort_index().loc[(1983, 'Folger Shakespeare Library')].shape[0]