This notebook is adapted by Zhuo Chen from the notebooks created by Nathan Kelber, William Mattingly and Melanie Walsh under Creative Commons CC BY License
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org.
Pandas Basics 2#
Description: This notebook describes how to:
Filter data in a dataframe
Work with missing values in a dataframe
Index a dataframe
Sort a dataframe
This is the second notebook in a series on learning to use Pandas.
Use Case: For Learners (Detailed explanation, not ideal for researchers)
Difficulty: Beginner
Knowledge Required:
Python Basics (Start Python Basics I)
Knowledge Recommended:
Completion Time: 90 minutes
Data Format: CSV (.csv)
Libraries Used: Pandas
Research Pipeline: None
# Import pandas library, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas
import pandas as pd
# Change the display setting
pd.set_option('display.min_rows', 20) # set the minimum number of rows to display to 20
Filter data in a dataframe#
A common pipeline in data processing in Pandas is that you create a dataframe from a file and then reduce the dataframe only to the rows and columns that you are interested in.
We have learned how to use .loc
and .iloc
to select part of a dataframe in Pandas Basics 1. We will learn more ways to do data selection in this section.
# download the sample file
import urllib
from pathlib import Path
# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_failed_banks_since_2000.csv'
file = '../data/failed_banks.csv'
urllib.request.urlretrieve(url, file)
print('Sample file ready.')
# Read in the data
banks_df = pd.read_csv(file)
banks_df
Sample file ready.
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 | 10543 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
... | ... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 | 4653 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
566 rows × 7 columns
The drop() method#
After creating a dataframe from a file, oftentimes we will drop certain rows or columns because we will not use them in the analysis anyways. In this case, we will use the .drop()
method to remove those rows or columns.
We can specify which column(s) to drop by using the ‘columns’ parameter.
# Drop a column by setting the columns parameter
banks_df.drop(columns='Fund')
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | |
---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 |
... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 |
566 rows × 6 columns
We can also drop multiple columns at one time. In this case, we just put the names for the columns we want to drop in a list and set the value of the columns
parameter to the list.
# Drop multiple columns by setting the columns parameter
banks_df.drop(columns=['Fund', 'Cert'])
Bank Name | City | State | Acquiring Institution | Closing Date | |
---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | JPMorgan Chase Bank, N.A. | 1-May-23 |
1 | Signature Bank | New York | NY | Flagstar Bank, N.A. | 12-Mar-23 |
2 | Silicon Valley Bank | Santa Clara | CA | First Citizens Bank & Trust Company | 10-Mar-23 |
3 | Almena State Bank | Almena | KS | Equity Bank | 23-Oct-20 |
4 | First City Bank of Florida | Fort Walton Beach | FL | United Fidelity Bank, fsb | 16-Oct-20 |
5 | The First State Bank | Barboursville | WV | MVB Bank, Inc. | 3-Apr-20 |
6 | Ericson State Bank | Ericson | NE | Farmers and Merchants Bank | 14-Feb-20 |
7 | City National Bank of New Jersey | Newark | NJ | Industrial Bank | 1-Nov-19 |
8 | Resolute Bank | Maumee | OH | Buckeye State Bank | 25-Oct-19 |
9 | Louisa Community Bank | Louisa | KY | Kentucky Farmers Bank Corporation | 25-Oct-19 |
... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | NaN | 7-Feb-02 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | The State Bank & Trust Company | 1-Feb-02 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | The Security State Bank of Pecos | 18-Jan-02 |
559 | Hamilton Bank, NA | Miami | FL | Israel Discount Bank of New York | 11-Jan-02 |
560 | Sinclair National Bank | Gravette | AR | Delta Trust & Bank | 7-Sep-01 |
561 | Superior Bank, FSB | Hinsdale | IL | Superior Federal, FSB | 27-Jul-01 |
562 | Malta National Bank | Malta | OH | North Valley Bank | 3-May-01 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | Southern New Hampshire Bank & Trust | 2-Feb-01 |
564 | National State Bank of Metropolis | Metropolis | IL | Banterra Bank of Marion | 14-Dec-00 |
565 | Bank of Honolulu | Honolulu | HI | Bank of the Orient | 13-Oct-00 |
566 rows × 5 columns
Another way to drop a column is to give the label of the column you want to drop and then set the axis parameter to 1.
# Drop a column by setting the axis parameter
banks_df.drop('Fund', axis=1)
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | |
---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 |
... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 |
566 rows × 6 columns
It might not be intuitive to you that axis 1 refers to the columns. Luckily, Pandas also allows us to set the axis parameter to columns
when dropping columns.
# Drop a column
banks_df.drop('Fund', axis='columns')
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | |
---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 |
... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 |
566 rows × 6 columns
You can also drop multiple columns by setting the axis parameter.
# Drop multiple columns by setting the axis parameter
banks_df.drop(['Fund', 'Cert'], axis=1)
Bank Name | City | State | Acquiring Institution | Closing Date | |
---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | JPMorgan Chase Bank, N.A. | 1-May-23 |
1 | Signature Bank | New York | NY | Flagstar Bank, N.A. | 12-Mar-23 |
2 | Silicon Valley Bank | Santa Clara | CA | First Citizens Bank & Trust Company | 10-Mar-23 |
3 | Almena State Bank | Almena | KS | Equity Bank | 23-Oct-20 |
4 | First City Bank of Florida | Fort Walton Beach | FL | United Fidelity Bank, fsb | 16-Oct-20 |
5 | The First State Bank | Barboursville | WV | MVB Bank, Inc. | 3-Apr-20 |
6 | Ericson State Bank | Ericson | NE | Farmers and Merchants Bank | 14-Feb-20 |
7 | City National Bank of New Jersey | Newark | NJ | Industrial Bank | 1-Nov-19 |
8 | Resolute Bank | Maumee | OH | Buckeye State Bank | 25-Oct-19 |
9 | Louisa Community Bank | Louisa | KY | Kentucky Farmers Bank Corporation | 25-Oct-19 |
... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | NaN | 7-Feb-02 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | The State Bank & Trust Company | 1-Feb-02 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | The Security State Bank of Pecos | 18-Jan-02 |
559 | Hamilton Bank, NA | Miami | FL | Israel Discount Bank of New York | 11-Jan-02 |
560 | Sinclair National Bank | Gravette | AR | Delta Trust & Bank | 7-Sep-01 |
561 | Superior Bank, FSB | Hinsdale | IL | Superior Federal, FSB | 27-Jul-01 |
562 | Malta National Bank | Malta | OH | North Valley Bank | 3-May-01 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | Southern New Hampshire Bank & Trust | 2-Feb-01 |
564 | National State Bank of Metropolis | Metropolis | IL | Banterra Bank of Marion | 14-Dec-00 |
565 | Bank of Honolulu | Honolulu | HI | Bank of the Orient | 13-Oct-00 |
566 rows × 5 columns
To drop a row, you just pass the index number of the row to the .drop()
method.
# Drop a row
banks_df.drop(0)
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
10 | The Enloe State Bank | Cooper | TX | 10716 | Legend Bank, N. A. | 31-May-19 | 10531 |
... | ... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 | 4653 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
565 rows × 7 columns
If you assign the index number of the row to drop to the parameter index
, you will get the same result.
# Drop a row by setting the index parameter
banks_df.drop(index=0)
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
10 | The Enloe State Bank | Cooper | TX | 10716 | Legend Bank, N. A. | 31-May-19 | 10531 |
... | ... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 | 4653 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
565 rows × 7 columns
To drop multiple rows, we just pass a list of index numbers to the .drop()
method. Note that we don’t need to specify the axis parameter. By default, Pandas knows that the list of index numbers are for the rows.
# Drop multiple rows
banks_df.drop([0,1])
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
10 | The Enloe State Bank | Cooper | TX | 10716 | Legend Bank, N. A. | 31-May-19 | 10531 |
11 | Washington Federal Bank for Savings | Chicago | IL | 30570 | Royal Savings Bank | 15-Dec-17 | 10530 |
... | ... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 | 4653 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
564 rows × 7 columns
Of course, you could specify the axis parameter to 0 or ‘rows’ to tell Pandas that you want to drop the rows with the given index numbers.
# Drop rows by setting the axis parameter
banks_df.drop([0,1], axis=0)
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
10 | The Enloe State Bank | Cooper | TX | 10716 | Legend Bank, N. A. | 31-May-19 | 10531 |
11 | Washington Federal Bank for Savings | Chicago | IL | 30570 | Royal Savings Bank | 15-Dec-17 | 10530 |
... | ... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 | 4653 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
564 rows × 7 columns
# Drop rows by setting the axis parameter
banks_df.drop([0,1], axis='rows')
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
10 | The Enloe State Bank | Cooper | TX | 10716 | Legend Bank, N. A. | 31-May-19 | 10531 |
11 | Washington Federal Bank for Savings | Chicago | IL | 30570 | Royal Savings Bank | 15-Dec-17 | 10530 |
... | ... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 | 4653 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
564 rows × 7 columns
Note that the .drop()
method only returns a copy. This means that any change you make using the .drop()
method will not affect the original dataframe. To make the change permanent, you can assign the result to the variable where you store the original dataframe to update it.
Coding Challenge! < / >
In the exercises in this notebook, we’ll work on a dataset built from Constellate.
We’ll use the constellate
client to automatically retrieve the metadata for a dataset. We can retrieve the metadata in a CSV file using the get_metadata
method.
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
# retrieve the metadata
import constellate
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
metadata = constellate.get_metadata(dataset_id)
# Check the type of the value stored in metadata
type(metadata)
# print out the string stored in metadata
print(metadata)
The metadata is stored in a .csv file. In the following code cell, read in the data using Pandas. Give the dataframe a name. Then print out the dataframe to take a look.
# Convert the csv file to a dataframe
shake_df = pd.read_csv(metadata)
shake_df
Use a Pandas method to explore the dataframe. How many rows does it have? How many columns does it have? What is the data type of the data in each column?
# Use a Pandas method to explore the data
When you explore the Shakespeare dataframe, what did you find about the column doi
? What did you find about the column placeOfPublication
? Drop the two columns.
# Drop the 'doi' and 'placeOfPublication' columns
Work with missing values#
It is a common case that datasets have missing values. As you may have already noticed, blank cells in a csv file show up as NaN in a Pandas DataFrame. For example, in the dataset of failed banks, the Acquiring Institution
column gives the name when a failed bank was acquired by another institution and has a value of NaN otherwise.
# Take a look at the missing values in banks_df
banks_df
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 | 10543 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
... | ... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 | 4653 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
566 rows × 7 columns
To quickly get an idea of whether a dataframe has missing values, we can always use the info()
method introduced in Pandas Basics 1 to explore the data.
# Use info() to see whether there are missing values
banks_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 566 entries, 0 to 565
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bank Name 566 non-null object
1 City 566 non-null object
2 State 566 non-null object
3 Cert 566 non-null int64
4 Acquiring Institution 535 non-null object
5 Closing Date 566 non-null object
6 Fund 566 non-null int64
dtypes: int64(2), object(5)
memory usage: 31.1+ KB
In Pandas, we can also use the .isna()
method to check whether a pandas series or a dataframe has missing values. What this method does is it creates a boolean mask over the data.
# Use isna() to check whether a dataframe has missing values
banks_df.isna()
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False |
5 | False | False | False | False | False | False | False |
6 | False | False | False | False | False | False | False |
7 | False | False | False | False | False | False | False |
8 | False | False | False | False | False | False | False |
9 | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... |
556 | False | False | False | False | True | False | False |
557 | False | False | False | False | False | False | False |
558 | False | False | False | False | False | False | False |
559 | False | False | False | False | False | False | False |
560 | False | False | False | False | False | False | False |
561 | False | False | False | False | False | False | False |
562 | False | False | False | False | False | False | False |
563 | False | False | False | False | False | False | False |
564 | False | False | False | False | False | False | False |
565 | False | False | False | False | False | False | False |
566 rows × 7 columns
As we can see, the cells with a non-null value are masked with the boolean value of False
. The cells with a null value are masked with the boolean value of True
.
Since isna()
can check whether a pandas series has missing values, we can use .isna()
to see whether a specific column in a dataframe has missing values.
# Use isna() to check whether a column has missing values
banks_df['Acquiring Institution'].isna()
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
...
556 True
557 False
558 False
559 False
560 False
561 False
562 False
563 False
564 False
565 False
Name: Acquiring Institution, Length: 566, dtype: bool
Drop rows and columns with missing values#
If you want to exclude the rows and columns with missing values from your data analysis, you can use the .dropna()
method to do that.
By default, the .dropna()
method drops the rows with at least one missing value.
# Use .dropna() to remove all rows with at least one missing value
banks_df.dropna() # no argument passed in
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 | 10543 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
... | ... | ... | ... | ... | ... | ... | ... |
555 | Net 1st National Bank | Boca Raton | FL | 26652 | Bank Leumi USA | 1-Mar-02 | 4654 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
535 rows × 7 columns
Based on what you have learned about the .drop()
method, you now know that you can also set the axis parameter to 0 or ‘rows’ to drop the rows with missing values.
# Use .dropna() to remove all rows with at least one missing value
banks_df.dropna(axis=0) # Set the axis to 0
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 | 10543 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
... | ... | ... | ... | ... | ... | ... | ... |
555 | Net 1st National Bank | Boca Raton | FL | 26652 | Bank Leumi USA | 1-Mar-02 | 4654 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
535 rows × 7 columns
# Use .dropna() to remove all rows with at least one missing value
banks_df.dropna(axis='rows') # Set the axis to 'rows'
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 | 10543 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
... | ... | ... | ... | ... | ... | ... | ... |
555 | Net 1st National Bank | Boca Raton | FL | 26652 | Bank Leumi USA | 1-Mar-02 | 4654 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
535 rows × 7 columns
If you set the axis parameter to 1 or ‘columns’, you will drop the columns with missing values.
# Use .dropna() to remove all columns with at least one missing value
banks_df.dropna(axis=1) # set the axis to 1
Bank Name | City | State | Cert | Closing Date | Fund | |
---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | 1-May-23 | 10543 |
1 | Signature Bank | New York | NY | 57053 | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | 25-Oct-19 | 10532 |
... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | 7-Feb-02 | 4653 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | 13-Oct-00 | 4645 |
566 rows × 6 columns
# Use .dropna() to remove all columns with at least one missing value
banks_df.dropna(axis='columns') # set the axis to 'columns'
Bank Name | City | State | Cert | Closing Date | Fund | |
---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | 1-May-23 | 10543 |
1 | Signature Bank | New York | NY | 57053 | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | 25-Oct-19 | 10532 |
... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | 7-Feb-02 | 4653 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | 13-Oct-00 | 4645 |
566 rows × 6 columns
Sometimes we would want to drop a row only if that row has a missing value in a specific column. We can use the subset parameter to specify the column(s) to look for missing values.
# Specify the columns to look for missing values
banks_df.dropna(subset=['Acquiring Institution', 'City'])
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 | 10543 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 |
... | ... | ... | ... | ... | ... | ... | ... |
555 | Net 1st National Bank | Boca Raton | FL | 26652 | Bank Leumi USA | 1-Mar-02 | 4654 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 |
535 rows × 7 columns
Note that the .dropna()
method only returns a copy. This means that any change you make using the .dropna()
method will not affect the original dataframe. To make the change permanent, you can assign the result to the variable where you store the original dataframe to update it.
Oftentimes, you would want to maintain the rows and columns that have missing values. However, you would want to fill the cells with NaN values with some values which are of the same data type as the other cells in the same column. In this way, when you apply a certain function to a column in a dataframe, you will not run into type error. A common practice to deal with this kind of problem is to use the .fillna()
method.
# Fill the missing values
banks_df['Acquiring Institution'].fillna('No Acquirer')
0 JPMorgan Chase Bank, N.A.
1 Flagstar Bank, N.A.
2 First Citizens Bank & Trust Company
3 Equity Bank
4 United Fidelity Bank, fsb
5 MVB Bank, Inc.
6 Farmers and Merchants Bank
7 Industrial Bank
8 Buckeye State Bank
9 Kentucky Farmers Bank Corporation
...
556 No Acquirer
557 The State Bank & Trust Company
558 The Security State Bank of Pecos
559 Israel Discount Bank of New York
560 Delta Trust & Bank
561 Superior Federal, FSB
562 North Valley Bank
563 Southern New Hampshire Bank & Trust
564 Banterra Bank of Marion
565 Bank of the Orient
Name: Acquiring Institution, Length: 566, dtype: object
Coding Challenge! < / >
Let’s grab the Shakespeare dataframe we have created and do some further filtering.
When you explore the Shakespeare dataframe, what did you find about the column pagestart
? What did you find about the column pageEnd
? Suppose you will need the page range information in your analysis. Therefore, you will want to drop any rows with missing values in these two columns. How do you do it?
# Drop any rows with missing values in 'pageStart' or 'pageEnd'
Filter data using conditionals#
Conditional selection using df.loc[]
is a very common method to filter a dataframe.
You write a filtering condition to filter a target column. The condition then checks, for each cell in the target column, whether it fulfills the condition or not. The results will be returned as a Series of True/False values. The .loc
indexer then uses this Series to select the rows that have True values.
Suppose you are interested in the banks that failed since 2000 in the state of Georgia. From the original dataframe, you would like to get all the rows of the failed banks in Georgia. How do you do it?
# Write a filtering condition
banks_df['State'] == 'GA' # Create a boolean mask over the column 'State'
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
...
556 False
557 False
558 False
559 False
560 False
561 False
562 False
563 False
564 False
565 False
Name: State, Length: 566, dtype: bool
# Assign the filtering condition to a variable
filt = (banks_df['State'] == 'GA') # Use parenthesis for better reading
# Put the Series returned by the filtering condition within the hard brackets of banks_df.loc[]
banks_df.loc[filt]
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | |
---|---|---|---|---|---|---|---|
20 | The Woodbury Banking Company | Woodbury | GA | 11297 | United Bank | 19-Aug-16 | 10521 |
25 | The Bank of Georgia | Peachtree City | GA | 35259 | Fidelity Bank | 2-Oct-15 | 10516 |
29 | Capitol City Bank & Trust Company | Atlanta | GA | 33938 | First-Citizens Bank & Trust Company | 13-Feb-15 | 10512 |
37 | Eastside Commercial Bank | Conyers | GA | 58125 | Community & Southern Bank | 18-Jul-14 | 10504 |
62 | Sunrise Bank | Valdosta | GA | 58185 | Synovus Bank | 10-May-13 | 10481 |
64 | Douglas County Bank | Douglasville | GA | 21649 | Hamilton State Bank | 26-Apr-13 | 10476 |
70 | Frontier Bank | LaGrange | GA | 16431 | HeritageBank of the South | 8-Mar-13 | 10471 |
75 | Hometown Community Bank | Braselton | GA | 57928 | CertusBank, National Association | 16-Nov-12 | 10466 |
86 | Jasper Banking Company | Jasper | GA | 16240 | Stearns Bank N.A. | 27-Jul-12 | 10455 |
89 | First Cherokee State Bank | Woodstock | GA | 32711 | Community & Southern Bank | 20-Jul-12 | 10450 |
... | ... | ... | ... | ... | ... | ... | ... |
496 | FirstCity Bank | Stockbridge | GA | 18243 | NaN | 20-Mar-09 | 10047 |
497 | Freedom Bank of Georgia | Commerce | GA | 57558 | Northeast Georgia Bank | 6-Mar-09 | 10044 |
507 | FirstBank Financial Services | McDonough | GA | 57017 | Regions Bank | 6-Feb-09 | 10036 |
515 | Haven Trust Bank | Duluth | GA | 35379 | Branch Banking & Trust Company, (BB&T) | 12-Dec-08 | 10027 |
516 | First Georgia Community Bank | Jackson | GA | 34301 | United Bank | 5-Dec-08 | 10025 |
519 | Community Bank | Loganville | GA | 16490 | Bank of Essex | 21-Nov-08 | 10022 |
523 | Alpha Bank & Trust | Alpharetta | GA | 58241 | Stearns Bank, N.A. | 24-Oct-08 | 10018 |
529 | Integrity Bank | Alpharetta | GA | 35469 | Regions Bank | 29-Aug-08 | 10012 |
540 | NetBank | Alpharetta | GA | 32575 | ING DIRECT | 28-Sep-07 | 10001 |
551 | AmTrade International Bank | Atlanta | GA | 33784 | NaN | 30-Sep-02 | 4657 |
93 rows × 7 columns
Out of the rows that fulfill the filtering condition, we can further specify which columns to be returned.
# Specify a single column to be returned
banks_df.loc[filt, 'Bank Name']
20 The Woodbury Banking Company
25 The Bank of Georgia
29 Capitol City Bank & Trust Company
37 Eastside Commercial Bank
62 Sunrise Bank
64 Douglas County Bank
70 Frontier Bank
75 Hometown Community Bank
86 Jasper Banking Company
89 First Cherokee State Bank
...
496 FirstCity Bank
497 Freedom Bank of Georgia
507 FirstBank Financial Services
515 Haven Trust Bank
516 First Georgia Community Bank
519 Community Bank
523 Alpha Bank & Trust
529 Integrity Bank
540 NetBank
551 AmTrade International Bank
Name: Bank Name, Length: 93, dtype: object
Of course, we can select muliple columns to be returned out of the filtered rows.
# Specify multiple columns to be returned
banks_df.loc[filt, ['Bank Name', 'Fund']]
Bank Name | Fund | |
---|---|---|
20 | The Woodbury Banking Company | 10521 |
25 | The Bank of Georgia | 10516 |
29 | Capitol City Bank & Trust Company | 10512 |
37 | Eastside Commercial Bank | 10504 |
62 | Sunrise Bank | 10481 |
64 | Douglas County Bank | 10476 |
70 | Frontier Bank | 10471 |
75 | Hometown Community Bank | 10466 |
86 | Jasper Banking Company | 10455 |
89 | First Cherokee State Bank | 10450 |
... | ... | ... |
496 | FirstCity Bank | 10047 |
497 | Freedom Bank of Georgia | 10044 |
507 | FirstBank Financial Services | 10036 |
515 | Haven Trust Bank | 10027 |
516 | First Georgia Community Bank | 10025 |
519 | Community Bank | 10022 |
523 | Alpha Bank & Trust | 10018 |
529 | Integrity Bank | 10012 |
540 | NetBank | 10001 |
551 | AmTrade International Bank | 4657 |
93 rows × 2 columns
Now suppose you want to get all the failed banks whose name contains the word ‘community’.
# Get all the banks with the word 'community' in their name
filt = (banks_df['Bank Name'].str.contains('Community'))
banks_df.loc[filt, ['Bank Name']]
Bank Name | |
---|---|
9 | Louisa Community Bank |
18 | Harvest Community Bank |
30 | Highland Community Bank |
50 | Texas Community Bank, National Association |
53 | The Community's Bank |
55 | Community South Bank |
57 | First Community Bank of Southwest Florida (als... |
63 | Pisgah Community Bank |
66 | Chipola Community Bank |
73 | Westside Community Bank |
... | ... |
442 | Community First Bank |
443 | Community National Bank of Sarasota County |
447 | Peoples Community Bank |
472 | Neighborhood Community Bank |
473 | Community Bank of West Georgia |
476 | Southern Community Bank |
483 | Citizens Community Bank |
499 | Heritage Community Bank |
516 | First Georgia Community Bank |
519 | Community Bank |
63 rows × 1 columns
Conjunction of multiple filtering conditions: &
#
Oftentimes, you would want to filter a dataframe based on more complex conditions. For example, suppose you would like to get the banks in GA that were closed between 2008 and 2010. How do you use df.loc[ ]
to achieve it?
The location of the failed banks is stored in the State
column. The closing year of the banks is stored in the Closing Date
column.
# Create the first filtering condition restricting the state
filt1 = (banks_df['State'] == 'GA')
How to get the closing year of the banks?
# Get the data type of the 'Closing Date' column
banks_df['Closing Date'].info()
<class 'pandas.core.series.Series'>
RangeIndex: 566 entries, 0 to 565
Series name: Closing Date
Non-Null Count Dtype
-------------- -----
566 non-null object
dtypes: object(1)
memory usage: 4.6+ KB
# Create a new column storing the closing year of the banks
banks_df['Closing Year'] = banks_df['Closing Date'].str[-2:].astype(int) + 2000
banks_df
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | Closing Year | |
---|---|---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 | 10543 | 2023 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 | 2023 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 | 2023 |
3 | Almena State Bank | Almena | KS | 15426 | Equity Bank | 23-Oct-20 | 10538 | 2020 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 | 2020 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 | 2020 |
6 | Ericson State Bank | Ericson | NE | 18265 | Farmers and Merchants Bank | 14-Feb-20 | 10535 | 2020 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 | 2019 |
8 | Resolute Bank | Maumee | OH | 58317 | Buckeye State Bank | 25-Oct-19 | 10533 | 2019 |
9 | Louisa Community Bank | Louisa | KY | 58112 | Kentucky Farmers Bank Corporation | 25-Oct-19 | 10532 | 2019 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
556 | NextBank, NA | Phoenix | AZ | 22314 | NaN | 7-Feb-02 | 4653 | 2002 |
557 | Oakwood Deposit Bank Co. | Oakwood | OH | 8966 | The State Bank & Trust Company | 1-Feb-02 | 4652 | 2002 |
558 | Bank of Sierra Blanca | Sierra Blanca | TX | 22002 | The Security State Bank of Pecos | 18-Jan-02 | 4651 | 2002 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 | 2002 |
560 | Sinclair National Bank | Gravette | AR | 34248 | Delta Trust & Bank | 7-Sep-01 | 4649 | 2001 |
561 | Superior Bank, FSB | Hinsdale | IL | 32646 | Superior Federal, FSB | 27-Jul-01 | 6004 | 2001 |
562 | Malta National Bank | Malta | OH | 6629 | North Valley Bank | 3-May-01 | 4648 | 2001 |
563 | First Alliance Bank & Trust Co. | Manchester | NH | 34264 | Southern New Hampshire Bank & Trust | 2-Feb-01 | 4647 | 2001 |
564 | National State Bank of Metropolis | Metropolis | IL | 3815 | Banterra Bank of Marion | 14-Dec-00 | 4646 | 2000 |
565 | Bank of Honolulu | Honolulu | HI | 21029 | Bank of the Orient | 13-Oct-00 | 4645 | 2000 |
566 rows × 8 columns
# Create the second filtering condition restricting the closing year
filt2 = (banks_df['Closing Year'] > 2007) & (banks_df['Closing Year'] < 2011)
With the two filtering conditions, we are ready to extract the banks in GA that failed between 2008 and 2010.
# Use filt1 and filt2 to get the target rows
banks_df.loc[filt1 & filt2]
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | Closing Year | |
---|---|---|---|---|---|---|---|---|
219 | United Americas Bank, N.A. | Atlanta | GA | 35065 | State Bank and Trust Company | 17-Dec-10 | 10323 | 2010 |
220 | Appalachian Community Bank, FSB | McCaysville | GA | 58495 | Peoples Bank of East Tennessee | 17-Dec-10 | 10319 | 2010 |
221 | Chestatee State Bank | Dawsonville | GA | 34578 | Bank of the Ozarks | 17-Dec-10 | 10320 | 2010 |
229 | Darby Bank & Trust Co. | Vidalia | GA | 14580 | Ameris Bank | 12-Nov-10 | 10312 | 2010 |
230 | Tifton Banking Company | Tifton | GA | 57831 | Ameris Bank | 12-Nov-10 | 10313 | 2010 |
238 | The First National Bank of Barnesville | Barnesville | GA | 2119 | United Bank | 22-Oct-10 | 10304 | 2010 |
239 | The Gordon Bank | Gordon | GA | 33904 | Morris Bank | 22-Oct-10 | 10305 | 2010 |
251 | The Peoples Bank | Winder | GA | 182 | Community & Southern Bank | 17-Sep-10 | 10292 | 2010 |
252 | First Commerce Community Bank | Douglasville | GA | 57448 | Community & Southern Bank | 17-Sep-10 | 10289 | 2010 |
253 | Bank of Ellijay | Ellijay | GA | 58197 | Community & Southern Bank | 17-Sep-10 | 10287 | 2010 |
270 | Northwest Bank & Trust | Acworth | GA | 57658 | State Bank and Trust Company | 30-Jul-10 | 10274 | 2010 |
276 | Crescent Bank and Trust Company | Jasper | GA | 27559 | Renasant Bank | 23-Jul-10 | 10265 | 2010 |
289 | First National Bank | Savannah | GA | 34152 | The Savannah Bank, N.A. | 25-Jun-10 | 10251 | 2010 |
305 | Satilla Community Bank | Saint Marys | GA | 35114 | Ameris Bank | 14-May-10 | 10238 | 2010 |
334 | Unity National Bank | Cartersville | GA | 34678 | Bank of the Ozarks | 26-Mar-10 | 10208 | 2010 |
336 | McIntosh Commercial Bank | Carrollton | GA | 57399 | CharterBank | 26-Mar-10 | 10207 | 2010 |
339 | Bank of Hiawassee | Hiawassee | GA | 10054 | Citizens South Bank | 19-Mar-10 | 10202 | 2010 |
340 | Appalachian Community Bank | Ellijay | GA | 33989 | Community & Southern Bank | 19-Mar-10 | 10199 | 2010 |
342 | Century Security Bank | Duluth | GA | 58104 | Bank of Upson | 19-Mar-10 | 10198 | 2010 |
361 | Community Bank and Trust | Cornelia | GA | 5702 | SCBT National Association | 29-Jan-10 | 10180 | 2010 |
364 | First National Bank of Georgia | Carrollton | GA | 16480 | Community & Southern Bank | 29-Jan-10 | 10179 | 2010 |
380 | RockBridge Commercial Bank | Atlanta | GA | 58315 | NaN | 18-Dec-09 | 10164 | 2009 |
387 | The Tattnall Bank | Reidsville | GA | 12080 | Heritage Bank of the South | 4-Dec-09 | 10153 | 2009 |
388 | First Security National Bank | Norcross | GA | 26290 | State Bank and Trust Company | 4-Dec-09 | 10157 | 2009 |
389 | The Buckhead Community Bank | Atlanta | GA | 34663 | State Bank and Trust Company | 4-Dec-09 | 10152 | 2009 |
398 | United Security Bank | Sparta | GA | 22286 | Ameris Bank | 6-Nov-09 | 10145 | 2009 |
413 | American United Bank | Lawrenceville | GA | 57794 | Ameris Bank | 23-Oct-09 | 10127 | 2009 |
419 | Georgian Bank | Atlanta | GA | 57151 | First Citizens Bank and Trust Company, Inc. | 25-Sep-09 | 10122 | 2009 |
435 | First Coweta Bank | Newnan | GA | 57702 | United Bank | 21-Aug-09 | 10108 | 2009 |
436 | ebank | Atlanta | GA | 34682 | Stearns Bank, N.A. | 21-Aug-09 | 10107 | 2009 |
450 | Security Bank of Jones County | Gray | GA | 8486 | State Bank and Trust Company | 24-Jul-09 | 10088 | 2009 |
451 | Security Bank of Houston County | Perry | GA | 27048 | State Bank and Trust Company | 24-Jul-09 | 10087 | 2009 |
452 | Security Bank of Bibb County | Macon | GA | 27367 | State Bank and Trust Company | 24-Jul-09 | 10085 | 2009 |
453 | Security Bank of North Metro | Woodstock | GA | 57105 | State Bank and Trust Company | 24-Jul-09 | 10090 | 2009 |
454 | Security Bank of North Fulton | Alpharetta | GA | 57430 | State Bank and Trust Company | 24-Jul-09 | 10089 | 2009 |
455 | Security Bank of Gwinnett County | Suwanee | GA | 57346 | State Bank and Trust Company | 24-Jul-09 | 10086 | 2009 |
460 | First Piedmont Bank | Winder | GA | 34594 | First American Bank and Trust Company | 17-Jul-09 | 10084 | 2009 |
472 | Neighborhood Community Bank | Newnan | GA | 35285 | CharterBank | 26-Jun-09 | 10069 | 2009 |
473 | Community Bank of West Georgia | Villa Rica | GA | 57436 | NaN | 26-Jun-09 | 10068 | 2009 |
476 | Southern Community Bank | Fayetteville | GA | 35251 | United Community Bank | 19-Jun-09 | 10067 | 2009 |
484 | Silverton Bank, NA | Atlanta | GA | 26535 | NaN | 1-May-09 | 10059 | 2009 |
488 | American Southern Bank | Kennesaw | GA | 57943 | Bank of North Georgia | 24-Apr-09 | 10053 | 2009 |
493 | Omni National Bank | Atlanta | GA | 22238 | NaN | 27-Mar-09 | 10048 | 2009 |
496 | FirstCity Bank | Stockbridge | GA | 18243 | NaN | 20-Mar-09 | 10047 | 2009 |
497 | Freedom Bank of Georgia | Commerce | GA | 57558 | Northeast Georgia Bank | 6-Mar-09 | 10044 | 2009 |
507 | FirstBank Financial Services | McDonough | GA | 57017 | Regions Bank | 6-Feb-09 | 10036 | 2009 |
515 | Haven Trust Bank | Duluth | GA | 35379 | Branch Banking & Trust Company, (BB&T) | 12-Dec-08 | 10027 | 2008 |
516 | First Georgia Community Bank | Jackson | GA | 34301 | United Bank | 5-Dec-08 | 10025 | 2008 |
519 | Community Bank | Loganville | GA | 16490 | Bank of Essex | 21-Nov-08 | 10022 | 2008 |
523 | Alpha Bank & Trust | Alpharetta | GA | 58241 | Stearns Bank, N.A. | 24-Oct-08 | 10018 | 2008 |
529 | Integrity Bank | Alpharetta | GA | 35469 | Regions Bank | 29-Aug-08 | 10012 | 2008 |
Note that when we extract rows that fulfill multiple conditions, we use &
in Pandas, not and
. If you replace &
with and
, you will get an error. This is different than what we have learned about boolean operators in Python basics 2. In Python, we use and
, or
and not
. In Pandas, we use &
, |
and ~
intead.
Pandas Operator |
Boolean |
Requires |
---|---|---|
& |
and |
All required to |
| |
or |
If any are |
~ |
not |
The opposite |
Although we use different symbols for these boolean operators, the truth table for them stays the same. For a quick review of the truth table, see Python basics 2.
Disjunction of multiple filtering conditions: |
#
Suppose you would like to take a look at all the failed banks in the state of Georgia or the state of New York. How do you use df.loc[]
to get the target rows?
# Create the two filtering conditions restricting the state to GA and NY
filt1 = (banks_df['State'] == 'GA')
filt2 = (banks_df['State'] == 'NY')
# Use filt1 and filt2 to get the target rows
banks_df.loc[filt1|filt2]
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | Closing Year | |
---|---|---|---|---|---|---|---|---|
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 | 2023 |
20 | The Woodbury Banking Company | Woodbury | GA | 11297 | United Bank | 19-Aug-16 | 10521 | 2016 |
25 | The Bank of Georgia | Peachtree City | GA | 35259 | Fidelity Bank | 2-Oct-15 | 10516 | 2015 |
29 | Capitol City Bank & Trust Company | Atlanta | GA | 33938 | First-Citizens Bank & Trust Company | 13-Feb-15 | 10512 | 2015 |
37 | Eastside Commercial Bank | Conyers | GA | 58125 | Community & Southern Bank | 18-Jul-14 | 10504 | 2014 |
62 | Sunrise Bank | Valdosta | GA | 58185 | Synovus Bank | 10-May-13 | 10481 | 2013 |
64 | Douglas County Bank | Douglasville | GA | 21649 | Hamilton State Bank | 26-Apr-13 | 10476 | 2013 |
70 | Frontier Bank | LaGrange | GA | 16431 | HeritageBank of the South | 8-Mar-13 | 10471 | 2013 |
75 | Hometown Community Bank | Braselton | GA | 57928 | CertusBank, National Association | 16-Nov-12 | 10466 | 2012 |
86 | Jasper Banking Company | Jasper | GA | 16240 | Stearns Bank N.A. | 27-Jul-12 | 10455 | 2012 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
497 | Freedom Bank of Georgia | Commerce | GA | 57558 | Northeast Georgia Bank | 6-Mar-09 | 10044 | 2009 |
507 | FirstBank Financial Services | McDonough | GA | 57017 | Regions Bank | 6-Feb-09 | 10036 | 2009 |
515 | Haven Trust Bank | Duluth | GA | 35379 | Branch Banking & Trust Company, (BB&T) | 12-Dec-08 | 10027 | 2008 |
516 | First Georgia Community Bank | Jackson | GA | 34301 | United Bank | 5-Dec-08 | 10025 | 2008 |
519 | Community Bank | Loganville | GA | 16490 | Bank of Essex | 21-Nov-08 | 10022 | 2008 |
523 | Alpha Bank & Trust | Alpharetta | GA | 58241 | Stearns Bank, N.A. | 24-Oct-08 | 10018 | 2008 |
529 | Integrity Bank | Alpharetta | GA | 35469 | Regions Bank | 29-Aug-08 | 10012 | 2008 |
540 | NetBank | Alpharetta | GA | 32575 | ING DIRECT | 28-Sep-07 | 10001 | 2007 |
543 | Reliance Bank | White Plains | NY | 26778 | Union State Bank | 19-Mar-04 | 4664 | 2004 |
551 | AmTrade International Bank | Atlanta | GA | 33784 | NaN | 30-Sep-02 | 4657 | 2002 |
99 rows × 8 columns
If you would like to get the data of the failed banks in the following six states — Georgia, New York, New Jersey, Florida, California and West Virginia, you will not want to write six filtering conditions and use the vertical bar |
to connect all of them. That would be too repetitive. In this case, we can use the .isin()
method to create a filtering condition.
# Create a list of the states
states = ['GA', 'NY', 'NJ', 'FL', 'CA', 'WV']
# Create a filtering condition
filt = (banks_df['State'].isin(states))
# Use filt to find all failed banks in the six states
banks_df.loc[filt]
Bank Name | City | State | Cert | Acquiring Institution | Closing Date | Fund | Closing Year | |
---|---|---|---|---|---|---|---|---|
0 | First Republic Bank | San Francisco | CA | 59017 | JPMorgan Chase Bank, N.A. | 1-May-23 | 10543 | 2023 |
1 | Signature Bank | New York | NY | 57053 | Flagstar Bank, N.A. | 12-Mar-23 | 10540 | 2023 |
2 | Silicon Valley Bank | Santa Clara | CA | 24735 | First Citizens Bank & Trust Company | 10-Mar-23 | 10539 | 2023 |
4 | First City Bank of Florida | Fort Walton Beach | FL | 16748 | United Fidelity Bank, fsb | 16-Oct-20 | 10537 | 2020 |
5 | The First State Bank | Barboursville | WV | 14361 | MVB Bank, Inc. | 3-Apr-20 | 10536 | 2020 |
7 | City National Bank of New Jersey | Newark | NJ | 21111 | Industrial Bank | 1-Nov-19 | 10534 | 2019 |
18 | Harvest Community Bank | Pennsville | NJ | 34951 | First-Citizens Bank & Trust Company | 13-Jan-17 | 10523 | 2017 |
20 | The Woodbury Banking Company | Woodbury | GA | 11297 | United Bank | 19-Aug-16 | 10521 | 2016 |
25 | The Bank of Georgia | Peachtree City | GA | 35259 | Fidelity Bank | 2-Oct-15 | 10516 | 2015 |
29 | Capitol City Bank & Trust Company | Atlanta | GA | 33938 | First-Citizens Bank & Trust Company | 13-Feb-15 | 10512 | 2015 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
532 | First Heritage Bank, NA | Newport Beach | CA | 57961 | Mutual of Omaha Bank | 25-Jul-08 | 10009 | 2008 |
534 | IndyMac Bank | Pasadena | CA | 29730 | OneWest Bank, FSB | 11-Jul-08 | 10007 | 2008 |
540 | NetBank | Alpharetta | GA | 32575 | ING DIRECT | 28-Sep-07 | 10001 | 2007 |
543 | Reliance Bank | White Plains | NY | 26778 | Union State Bank | 19-Mar-04 | 4664 | 2004 |
544 | Guaranty National Bank of Tallahassee | Tallahassee | FL | 26838 | Hancock Bank of Florida | 12-Mar-04 | 4663 | 2004 |
545 | Dollar Savings Bank | Newark | NJ | 31330 | NaN | 14-Feb-04 | 6006 | 2004 |
548 | Southern Pacific Bank | Torrance | CA | 27094 | Beal Bank | 7-Feb-03 | 4660 | 2003 |
551 | AmTrade International Bank | Atlanta | GA | 33784 | NaN | 30-Sep-02 | 4657 | 2002 |
555 | Net 1st National Bank | Boca Raton | FL | 26652 | Bank Leumi USA | 1-Mar-02 | 4654 | 2002 |
559 | Hamilton Bank, NA | Miami | FL | 24382 | Israel Discount Bank of New York | 11-Jan-02 | 4650 | 2002 |
228 rows × 8 columns
Negation of a certain condition:~
#
Now, suppose you would like to get all the failed banks that were not closed in 2008. How do you do it?
# Create the filtering condition restricting the closing year to non-2008
filt = (~(banks_df['Closing Year'] == 2008))
# Use the filtering condition to get the target rows with specified columns
banks_df.loc[filt, ['Bank Name', 'City']]
Bank Name | City | |
---|---|---|
0 | First Republic Bank | San Francisco |
1 | Signature Bank | New York |
2 | Silicon Valley Bank | Santa Clara |
3 | Almena State Bank | Almena |
4 | First City Bank of Florida | Fort Walton Beach |
5 | The First State Bank | Barboursville |
6 | Ericson State Bank | Ericson |
7 | City National Bank of New Jersey | Newark |
8 | Resolute Bank | Maumee |
9 | Louisa Community Bank | Louisa |
... | ... | ... |
556 | NextBank, NA | Phoenix |
557 | Oakwood Deposit Bank Co. | Oakwood |
558 | Bank of Sierra Blanca | Sierra Blanca |
559 | Hamilton Bank, NA | Miami |
560 | Sinclair National Bank | Gravette |
561 | Superior Bank, FSB | Hinsdale |
562 | Malta National Bank | Malta |
563 | First Alliance Bank & Trust Co. | Manchester |
564 | National State Bank of Metropolis | Metropolis |
565 | Bank of Honolulu | Honolulu |
541 rows × 2 columns
Coding Challenge! < / >
Let’s do some filtering!
From the Shakespeare dataframe, get the title and the creator of the documents published between 2000 and 2010.
# get the title and creator of docs published between 2000 and 2010
From the Shakespeare dataframe, get the creator of the documents shorter than 10 pages or longer than 50 pages.
# get the creator of the docs shorter than 10 pages or longer than 50 pages
From the Shakespeare dataframe, get the title of the documents whose publisher is not Folger Shakespeare Library.
# get the title of the docs whose publisher is NOT Folger Shakespeare Library
Index a dataframe#
In this section, we’ll continue to work with the dataframe we created in Pandas 1 storing data on the most recent 10 World Cup games.
# Create a dataframe with world cup data
wcup = pd.DataFrame({"Year": [2022,
2018,
2014,
2010,
2006,
2002,
1998,
1994,
1990,
1986],
"Champion": ["Argentina",
"France",
"Germany",
"Spain",
"Italy",
"Brazil",
"France",
"Brazil",
"Germany",
"Argentina"],
"Host": ["Qatar",
"Russia",
"Brazil",
"South Africa",
"Germany",
"Korea/Japan",
"France",
"USA",
"Italy",
"Mexico"],
"Score": ["7-5",
"4-2",
"1-0",
"1-0",
"6-4",
"2-0",
"3-0",
"3-2",
"1-0",
"3-2"]
})
wcup['Goals Scored'] = wcup['Score'].str[0]
wcup['Goals Conceded'] = wcup['Score'].str[-1]
wcup['Difference'] = wcup['Goals Scored'].astype(int) - wcup['Goals Conceded'].astype(int)
wcup
Year | Champion | Host | Score | Goals Scored | Goals Conceded | Difference | |
---|---|---|---|---|---|---|---|
0 | 2022 | Argentina | Qatar | 7-5 | 7 | 5 | 2 |
1 | 2018 | France | Russia | 4-2 | 4 | 2 | 2 |
2 | 2014 | Germany | Brazil | 1-0 | 1 | 0 | 1 |
3 | 2010 | Spain | South Africa | 1-0 | 1 | 0 | 1 |
4 | 2006 | Italy | Germany | 6-4 | 6 | 4 | 2 |
5 | 2002 | Brazil | Korea/Japan | 2-0 | 2 | 0 | 2 |
6 | 1998 | France | France | 3-0 | 3 | 0 | 3 |
7 | 1994 | Brazil | USA | 3-2 | 3 | 2 | 1 |
8 | 1990 | Germany | Italy | 1-0 | 1 | 0 | 1 |
9 | 1986 | Argentina | Mexico | 3-2 | 3 | 2 | 1 |
Set, reset and use indexes#
We have seen that by default, the rows in a dataframe are numbered by integer indexes starting from 0. The indexes look like a column to the far left without a name.
We can set the index column to one of the columns in the dataframe. This is desirable because a range of integers is not descriptive but a column with a name is descriptive. When we want to locate specific data, descriptive labels are much more useful.
# set the index column to 'Year'
wcup.set_index('Year')
Champion | Host | Score | Goals Scored | Goals Conceded | Difference | |
---|---|---|---|---|---|---|
Year | ||||||
2022 | Argentina | Qatar | 7-5 | 7 | 5 | 2 |
2018 | France | Russia | 4-2 | 4 | 2 | 2 |
2014 | Germany | Brazil | 1-0 | 1 | 0 | 1 |
2010 | Spain | South Africa | 1-0 | 1 | 0 | 1 |
2006 | Italy | Germany | 6-4 | 6 | 4 | 2 |
2002 | Brazil | Korea/Japan | 2-0 | 2 | 0 | 2 |
1998 | France | France | 3-0 | 3 | 0 | 3 |
1994 | Brazil | USA | 3-2 | 3 | 2 | 1 |
1990 | Germany | Italy | 1-0 | 1 | 0 | 1 |
1986 | Argentina | Mexico | 3-2 | 3 | 2 | 1 |
Take a look at the original dataframe, is it changed?
# Take a look at the original dataframe
wcup
Year | Champion | Host | Score | Goals Scored | Goals Conceded | Difference | |
---|---|---|---|---|---|---|---|
0 | 2022 | Argentina | Qatar | 7-5 | 7 | 5 | 2 |
1 | 2018 | France | Russia | 4-2 | 4 | 2 | 2 |
2 | 2014 | Germany | Brazil | 1-0 | 1 | 0 | 1 |
3 | 2010 | Spain | South Africa | 1-0 | 1 | 0 | 1 |
4 | 2006 | Italy | Germany | 6-4 | 6 | 4 | 2 |
5 | 2002 | Brazil | Korea/Japan | 2-0 | 2 | 0 | 2 |
6 | 1998 | France | France | 3-0 | 3 | 0 | 3 |
7 | 1994 | Brazil | USA | 3-2 | 3 | 2 | 1 |
8 | 1990 | Germany | Italy | 1-0 | 1 | 0 | 1 |
9 | 1986 | Argentina | Mexico | 3-2 | 3 | 2 | 1 |
The original dataframe is NOT changed after we use the .set_index()
method to change the index column. The .set_index()
method returns a copy, this is why the original dataframe is not affected.
If you want to make the change permanent, you can assign the returned object to the variable where you store the original dataframe to update it.
# Change the index column and commit the change
wcup = wcup.set_index('Year')
wcup
Champion | Host | Score | Goals Scored | Goals Conceded | Difference | |
---|---|---|---|---|---|---|
Year | ||||||
2022 | Argentina | Qatar | 7-5 | 7 | 5 | 2 |
2018 | France | Russia | 4-2 | 4 | 2 | 2 |
2014 | Germany | Brazil | 1-0 | 1 | 0 | 1 |
2010 | Spain | South Africa | 1-0 | 1 | 0 | 1 |
2006 | Italy | Germany | 6-4 | 6 | 4 | 2 |
2002 | Brazil | Korea/Japan | 2-0 | 2 | 0 | 2 |
1998 | France | France | 3-0 | 3 | 0 | 3 |
1994 | Brazil | USA | 3-2 | 3 | 2 | 1 |
1990 | Germany | Italy | 1-0 | 1 | 0 | 1 |
1986 | Argentina | Mexico | 3-2 | 3 | 2 | 1 |
You can change the index column back to the default integer index column using the reset_index()
method.
# reset the indexes to the integer indexes
wcup.reset_index()
Year | Champion | Host | Score | Goals Scored | Goals Conceded | Difference | |
---|---|---|---|---|---|---|---|
0 | 2022 | Argentina | Qatar | 7-5 | 7 | 5 | 2 |
1 | 2018 | France | Russia | 4-2 | 4 | 2 | 2 |
2 | 2014 | Germany | Brazil | 1-0 | 1 | 0 | 1 |
3 | 2010 | Spain | South Africa | 1-0 | 1 | 0 | 1 |
4 | 2006 | Italy | Germany | 6-4 | 6 | 4 | 2 |
5 | 2002 | Brazil | Korea/Japan | 2-0 | 2 | 0 | 2 |
6 | 1998 | France | France | 3-0 | 3 | 0 | 3 |
7 | 1994 | Brazil | USA | 3-2 | 3 | 2 | 1 |
8 | 1990 | Germany | Italy | 1-0 | 1 | 0 | 1 |
9 | 1986 | Argentina | Mexico | 3-2 | 3 | 2 | 1 |
Sort a dataframe#
A common use of the index column is to sort a dataframe. Here, we have ‘Year’, a numerical column as our index column. When we sort the indexes, by default, the dataframe will be sorted by the index column in an ascending order.
# Sort the dataframe by the index column
wcup.sort_index()
Champion | Host | Score | Goals Scored | Goals Conceded | Difference | |
---|---|---|---|---|---|---|
Year | ||||||
1986 | Argentina | Mexico | 3-2 | 3 | 2 | 1 |
1990 | Germany | Italy | 1-0 | 1 | 0 | 1 |
1994 | Brazil | USA | 3-2 | 3 | 2 | 1 |
1998 | France | France | 3-0 | 3 | 0 | 3 |
2002 | Brazil | Korea/Japan | 2-0 | 2 | 0 | 2 |
2006 | Italy | Germany | 6-4 | 6 | 4 | 2 |
2010 | Spain | South Africa | 1-0 | 1 | 0 | 1 |
2014 | Germany | Brazil | 1-0 | 1 | 0 | 1 |
2018 | France | Russia | 4-2 | 4 | 2 | 2 |
2022 | Argentina | Qatar | 7-5 | 7 | 5 | 2 |
You could set the parameter ascending=False
to sort the indexes in a descending order.
# Specify the ascending order
wcup.sort_index(ascending=False)
Champion | Host | Score | Goals Scored | Goals Conceded | Difference | |
---|---|---|---|---|---|---|
Year | ||||||
2022 | Argentina | Qatar | 7-5 | 7 | 5 | 2 |
2018 | France | Russia | 4-2 | 4 | 2 | 2 |
2014 | Germany | Brazil | 1-0 | 1 | 0 | 1 |
2010 | Spain | South Africa | 1-0 | 1 | 0 | 1 |
2006 | Italy | Germany | 6-4 | 6 | 4 | 2 |
2002 | Brazil | Korea/Japan | 2-0 | 2 | 0 | 2 |
1998 | France | France | 3-0 | 3 | 0 | 3 |
1994 | Brazil | USA | 3-2 | 3 | 2 | 1 |
1990 | Germany | Italy | 1-0 | 1 | 0 | 1 |
1986 | Argentina | Mexico | 3-2 | 3 | 2 | 1 |
Sort by one column#
We can sort the entire dataframe by a column other than the index column. The .sort_values()
method helps us do it.
# Sort the dataframe by the column 'Goals Scored'
wcup.sort_values(by=['Goals Scored'])
Champion | Host | Score | Goals Scored | Goals Conceded | Difference | |
---|---|---|---|---|---|---|
Year | ||||||
2014 | Germany | Brazil | 1-0 | 1 | 0 | 1 |
2010 | Spain | South Africa | 1-0 | 1 | 0 | 1 |
1990 | Germany | Italy | 1-0 | 1 | 0 | 1 |
2002 | Brazil | Korea/Japan | 2-0 | 2 | 0 | 2 |
1998 | France | France | 3-0 | 3 | 0 | 3 |
1994 | Brazil | USA | 3-2 | 3 | 2 | 1 |
1986 | Argentina | Mexico | 3-2 | 3 | 2 | 1 |
2018 | France | Russia | 4-2 | 4 | 2 | 2 |
2006 | Italy | Germany | 6-4 | 6 | 4 | 2 |
2022 | Argentina | Qatar | 7-5 | 7 | 5 | 2 |
Sort by multiple columns#
It is a convention to sort the soccer results first by difference (i.e. how many more goals the champion scored than the runner-up) and then by goals conceded (i.e. how many goals the champion lost). Pandas can easily do that.
# Sort the dataframe by Difference column in descending order
# then by Goals Conceded column in ascending order
wcup.sort_values(by=['Difference', 'Goals Conceded'], ascending=[False, True])
Champion | Host | Score | Goals Scored | Goals Conceded | Difference | |
---|---|---|---|---|---|---|
Year | ||||||
1998 | France | France | 3-0 | 3 | 0 | 3 |
2002 | Brazil | Korea/Japan | 2-0 | 2 | 0 | 2 |
2018 | France | Russia | 4-2 | 4 | 2 | 2 |
2006 | Italy | Germany | 6-4 | 6 | 4 | 2 |
2022 | Argentina | Qatar | 7-5 | 7 | 5 | 2 |
2014 | Germany | Brazil | 1-0 | 1 | 0 | 1 |
2010 | Spain | South Africa | 1-0 | 1 | 0 | 1 |
1990 | Germany | Italy | 1-0 | 1 | 0 | 1 |
1994 | Brazil | USA | 3-2 | 3 | 2 | 1 |
1986 | Argentina | Mexico | 3-2 | 3 | 2 | 1 |
# Reset the index for later use
wcup = wcup.reset_index()
Hierarchical indexing#
Sorting by multiple columns helps us group the data in a certain way. For example, in the world cup dataframe, if we would like to group the data first by the champions and then by the years, we can sort the dataframe by these two columns and set the two columns as the a multi-level index of the dataframe. The Champion
column will be the level 0 index and the Year
column will be the level 1 index. The hierarchical indexing allows us to work with higher dimension data.
# First,sort the two columns we will use for multi-level indexing
# Then, set the index to the composite of Champion and Year
wcup = wcup.sort_values(by=['Champion', 'Year'], ascending=[1,0]).set_index(['Champion', 'Year'])
wcup
Host | Score | Goals Scored | Goals Conceded | Difference | ||
---|---|---|---|---|---|---|
Champion | Year | |||||
Argentina | 2022 | Qatar | 7-5 | 7 | 5 | 2 |
1986 | Mexico | 3-2 | 3 | 2 | 1 | |
Brazil | 2002 | Korea/Japan | 2-0 | 2 | 0 | 2 |
1994 | USA | 3-2 | 3 | 2 | 1 | |
France | 2018 | Russia | 4-2 | 4 | 2 | 2 |
1998 | France | 3-0 | 3 | 0 | 3 | |
Germany | 2014 | Brazil | 1-0 | 1 | 0 | 1 |
1990 | Italy | 1-0 | 1 | 0 | 1 | |
Italy | 2006 | Germany | 6-4 | 6 | 4 | 2 |
Spain | 2010 | South Africa | 1-0 | 1 | 0 | 1 |
If a dataframe has a multi-level index, to access a certain row, we will need to provide a multi-level index in order to access it.
# Get the data on the 1986 world cup game won by Argentina
wcup.loc[('Argentina', 1986)]
Host Mexico
Score 3-2
Goals Scored 3
Goals Conceded 2
Difference 1
Name: (Argentina, 1986), dtype: object
Coding Challenge! < / >
In this coding challenge, we’ll use the Shakespeare dataframe. Use what you have learned about sorting and indexing a dataframe to answer the following questions.
# Take a look at the Shakespeare df
shake_df
Which doc(s) are the longest in this dataset?
# Get the longest doc(s) in the df
Among the docs published in 1983, how many were published by Folger Shakespeare Library?
# Among the docs published in 1983, how many were published by Folger Shakespeare Library
Lesson Complete#
Congratulations! You have completed Pandas Basics 2.
Start Next Lesson: Pandas 3 ->#
Exercise Solutions#
Here are a few solutions for exercises in this lesson.
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
# retrieve the metadata
import constellate
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
metadata = constellate.get_metadata(dataset_id)
# Create a dataframe
shake_df = pd.read_csv(metadata)
# Explore the dataframe
shake_df.info()
# Drop the columns of doi and the placeofPublication, make the change permanent
shake_df = shake_df.drop(columns=['doi', 'placeOfPublication'])
# Drop the rows with missing values in 'pageStart' or 'pageEnd'
shake_df = shake_df.dropna(subset=['pageStart', 'pageEnd'])
shake_df
# get the title and the creator of the documents published between 2000 and 2010
filt = (shake_df['publicationYear']>1999) & (shake_df['publicationYear']<2011)
shake_df.loc[filt, ['title', 'creator']]
# get the creator of the documents shorter than 10 pages or longer than 50 pages
filt = (shake_df['pageCount']<10)|(shake_df['pageCount']>50)
shake_df.loc[filt, 'creator']
# get the title of the documents whose publisher is not Folger Shakespeare Library
filt = (shake_df['publisher']=='Folger Shakespeare Library')
shake_df.loc[~filt, 'title']
# get the longest doc(s) in the dataset
shake_df.loc[:, ['title', 'pageCount']].sort_values(by='pageCount', ascending=False)
# get the number of docs published by Folger Shakespeare Library in 1983
shake_df.set_index(['publicationYear', 'publisher']).sort_index().loc[(1983, 'Folger Shakespeare Library')].shape[0]