Pandas Basics 1

Pandas Basics 1#

Description: This notebook describes how to:

Create a Pandas Series or a DataFrame
Create a dataframe from files of different formats
Explore the data in a dataframe
Access data from a dataframe
Set values in a dataframe
Create a new column based on an existing one

This is the first notebook in a series on learning to use Pandas.

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Beginner

Knowledge Required:

Python Basics (Start Python Basics I)

Knowledge Recommended:

Completion Time: 90 minutes

Data Format: .csv, .xsxl

Libraries Used: Pandas

Research Pipeline: None

Introduction#

Pandas is a Python library that allows you to easily work with tabular data. Most people are familiar with commercial spreadsheet software, such as Microsoft Excel or Google Sheets. While spreadsheet software and Pandas can accomplish similar tasks, each has significant advantages depending on the use-case.

Advantages of Spreadsheet Software

Point and click
Easier to learn
Great for small datasets (<10,000 rows)
Better for browsing data

Advantages of Pandas

More powerful data manipulation with Python
Can work with large datasets (millions of rows)
Faster for complicated manipulations
Better for cleaning and/or pre-processing data
Can automate workflows in a larger data pipeline

In short, spreadsheet software is better for browsing small datasets and making moderate adjustments. Pandas is better for automating data cleaning processes that require large or complex data manipulation.

# import pandas, `as pd` allows us to shorten typing `pandas` to `pd` when we call pandas
import pandas as pd

Pandas Series and Pandas DataFrame#

In Pandas, data are stored in two fundamental objects:

Pandas Series - a single column or row of data
Pandas DataFrame - a table of data containing multiple columns and rows

Pandas Series#

We can think of a Series as a single column or row of data. Here we have a column called Champions with the country names of the winners of the most recent ten FIFA world cup games.

Champions
Argentina
France
Germany
Spain
Italy
Brazil
France
Brazil
Germany
Argentina

Let’s create a Series based on this column. To create our Series, we pass a list into the .Series() method:

# Create a data series object in Pandas
champions = pd.Series(["Argentina",
                       "France", 
                       "Germany", 
                       "Spain", 
                       "Italy", 
                       "Brazil", 
                       "France", 
                       "Brazil", 
                       "Germany",
                       "Argentina"]
                    )

# Take a look at the Series
champions

  Argentina
     France
    Germany
      Spain
      Italy
     Brazil
     France
     Brazil
    Germany
  Argentina
dtype: object

As you can see, except the data column, we also have an index column. By default, the indexes are numbers starting from 0. We could define the indexes ourselves. To do that, we will pass a dictionary to the .Series() method. The keys of the dictionary will be used as indexes.

# Use self-defined indexes
pd.Series({2022: "Argentina", 
           2018: "France", 
           2014: "Germany", 
           2010: "Spain", 
           2006: "Italy", 
           2002: "Brazil", 
           1998: "France", 
           1994: "Brazil", 
           1990: "Germany",
           1986: "Argentina"}
         )

  Argentina
     France
    Germany
      Spain
      Italy
     Brazil
     France
     Brazil
    Germany
  Argentina
dtype: object

You can give a name to your Pandas Series using the name parameter.

# give a name to the series
pd.Series({2022: "Argentina", 
           2018: "France", 
           2014: "Germany", 
           2010: "Spain", 
           2006: "Italy", 
           2002: "Brazil", 
           1998: "France", 
           1994: "Brazil", 
           1990: "Germany",
           1986: "Argentina"}, 
          name = 'World Cup Champions'
         )

  Argentina
     France
    Germany
      Spain
      Italy
     Brazil
     France
     Brazil
    Germany
  Argentina
Name: World Cup Champions, dtype: object

Pandas DataFrame#

While a Pandas Series is 1-dimensional with a single column/row of data, a Pandas DataFrame is 2-dimensional and can have multiple columns and rows.

Year	Champion	Host
2022	Argentina	Qatar
2018	France	Russia
2014	Germany	Brazil
2010	Spain	South Africa
2006	Italy	Germany
2002	Brazil	Korea/Japan
1998	France	France
1994	Brazil	USA
1990	Germany	Italy
1986	Argentina	Mexico

Let’s create a Pandas DataFrame based on this table. To create our dataframe, we pass a dictionary into the DataFrame() method. Each key:value pair will form a column in the dataframe, with the key as the column name and the value as the data in that column.

# Create a Pandas dataframe
wcup = pd.DataFrame({"Year": [2022, 
                              2018, 
                              2014, 
                              2010, 
                              2006, 
                              2002, 
                              1998, 
                              1994, 
                              1990,
                              1986], 
                     "Champion": ["Argentina", 
                                  "France", 
                                  "Germany", 
                                  "Spain", 
                                  "Italy", 
                                  "Brazil", 
                                  "France", 
                                  "Brazil", 
                                  "Germany", 
                                  "Argentina"], 
                     "Host": ["Qatar", 
                              "Russia", 
                              "Brazil", 
                              "South Africa", 
                              "Germany", 
                              "Korea/Japan", 
                              "France", 
                              "USA", 
                              "Italy", 
                              "Mexico"]
                    })

wcup

	Year	Champion	Host
0	2022	Argentina	Qatar
1	2018	France	Russia
2	2014	Germany	Brazil
3	2010	Spain	South Africa
4	2006	Italy	Germany
5	2002	Brazil	Korea/Japan
6	1998	France	France
7	1994	Brazil	USA
8	1990	Germany	Italy
9	1986	Argentina	Mexico

In a Pandas dataframe, each row/column is technically a Pandas Series. We can see this by selecting the first row with the iloc method and check its type.

# Get the type of a row in a dataframe
type(wcup.iloc[0]) 

pandas.core.series.Series

Let’s select a column and check its type.

# Get the type of a column in a dataframe
type(wcup['Champion'])

pandas.core.series.Series

We will describe row/column selection in greater detail below.

Coding Challenge! < / >

You are a middle school teacher. You teach the Butterfly Class and the Hippo Class. Last week, the Butterfly class had an English test and a math test. You would like to make a dataframe to record the English grades and math grades of the students in the Butterfly Class.

Make a dataframe with three columns: name, English and Math.

# make a dataframe

Explore the data#

After we build a dataframe, it is helpful to get a general idea of the data. The first step is to explore the dataframe’s attributes. Attributes are properties of the dataframe (not functions), so they do not have parentheses () after them.

Attribute	Reveals
.shape	The number of rows and columns
.columns	The name of each column

To get how many rows and columns a dataframe has, we use the .shape attribute. df.shape returns a tuple with (number of rows, number of columns).

# df.shape returns a tuple (# of rows, # of columns)
wcup.shape

(10, 3)

# Use `.columns` attribute to find the column names
wcup.columns

Index(['Year', 'Champion', 'Host'], dtype='object')

There are some methods we can use to explore the data as well.

Method	Reveals
.info( )	Column count and data type
.head( )	First five rows
.tail( )	Last five rows

# Use `.info()` to get column count and data type
wcup.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Year      10 non-null     int64 
 1   Champion  10 non-null     object
 2   Host      10 non-null     object
dtypes: int64(1), object(2)
memory usage: 372.0+ bytes

We can get a preview of the dataframe. The .head() and .tail() methods help us do that.

# Display the first five rows of the data
wcup.head()

	Year	Champion	Host
0	2022	Argentina	Qatar
1	2018	France	Russia
2	2014	Germany	Brazil
3	2010	Spain	South Africa
4	2006	Italy	Germany

# Display the last five rows of the data
wcup.tail()

	Year	Champion	Host
5	2002	Brazil	Korea/Japan
6	1998	France	France
7	1994	Brazil	USA
8	1990	Germany	Italy
9	1986	Argentina	Mexico

# Specify the number of rows at the beginning of the table to display
wcup.head(8)

	Year	Champion	Host
0	2022	Argentina	Qatar
1	2018	France	Russia
2	2014	Germany	Brazil
3	2010	Spain	South Africa
4	2006	Italy	Germany
5	2002	Brazil	Korea/Japan
6	1998	France	France
7	1994	Brazil	USA

Read and write tabular data in Pandas#

Pandas provides different methods to read and write tabular data. The methods used to read in data from files are .read_*(). The methods used to write data into files are .to_*().

For example, we can create a dataframe from a csv file using the .read_csv() method.

### Download the csv file
from pathlib import Path
import urllib.request

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

# Download the file
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_failed_banks_since_2000.csv'
urllib.request.urlretrieve(url, '../data/failed_banks_since_2000.csv')

# Download success message
print('Sample file ready.')

Sample file ready.

# Use the read_csv() method to create a dataframe
failed_banks = pd.read_csv('../data/failed_banks_since_2000.csv')
failed_banks

	Bank Name	City	State	Cert	Acquiring Institution	Closing Date	Fund
0	First Republic Bank	San Francisco	CA	59017	JPMorgan Chase Bank, N.A.	1-May-23	10543
1	Signature Bank	New York	NY	57053	Flagstar Bank, N.A.	12-Mar-23	10540
2	Silicon Valley Bank	Santa Clara	CA	24735	First Citizens Bank & Trust Company	10-Mar-23	10539
3	Almena State Bank	Almena	KS	15426	Equity Bank	23-Oct-20	10538
4	First City Bank of Florida	Fort Walton Beach	FL	16748	United Fidelity Bank, fsb	16-Oct-20	10537
...	...	...	...	...	...	...	...
561	Superior Bank, FSB	Hinsdale	IL	32646	Superior Federal, FSB	27-Jul-01	6004
562	Malta National Bank	Malta	OH	6629	North Valley Bank	3-May-01	4648
563	First Alliance Bank & Trust Co.	Manchester	NH	34264	Southern New Hampshire Bank & Trust	2-Feb-01	4647
564	National State Bank of Metropolis	Metropolis	IL	3815	Banterra Bank of Marion	14-Dec-00	4646
565	Bank of Honolulu	Honolulu	HI	21029	Bank of the Orient	13-Oct-00	4645

566 rows × 7 columns

You can also write the tabular data from a dataframe into a file.

# write the dataframe we created for the world cup champions into a file
wcup.to_csv('../data/wcup_champions.csv')

Pandas can read data from files of many different formats and write data into files of many different formats.

Suppose you would like to read the data of COVID-19 cases in Massachusetts into a dataframe. Can you use a Pandas method to do that?

%pip install openpyxl

# Download the excel file
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_covid_MA_06292023.xlsx'
file = '../data/covid_MA.xlsx'
urllib.request.urlretrieve(url, file)
print('Sample file ready.')

# Read in the data
covid = pd.read_excel(file)
covid

Sample file ready.

	Date	Positive Total	Positive New	Probable Total	Probable New	Estimated active cases
0	2020-06-01	97291	326	NaN	NaN	NaN
1	2020-06-02	97539	248	NaN	NaN	NaN
2	2020-06-03	97964	425	NaN	NaN	NaN
3	2020-06-04	98376	412	NaN	NaN	NaN
4	2020-06-05	98796	420	NaN	NaN	NaN
...	...	...	...	...	...	...
691	2023-06-01	2041048	803	204213.0	325.0	NaN
692	2023-06-08	2041835	787	204574.0	361.0	NaN
693	2023-06-15	2042644	809	204841.0	267.0	NaN
694	2023-06-22	2043337	693	205138.0	297.0	NaN
695	2023-06-29	2043979	642	205513.0	375.0	NaN

696 rows × 6 columns

Now, can you write the dataframe you created into a json file?

# write the dataframe into a json file

Access data#

In this section, we will take a look at the different ways of accessing the data in a dataframe.

For example, once you get the column names, you could access a column of your interest. You can use the bracket notation df[ColumnName] to get a specific column.

# Use bracket notation to access the column 'Champion'
wcup['Champion']

  Argentina
     France
    Germany
      Spain
      Italy
     Brazil
     France
     Brazil
    Germany
  Argentina
Name: Champion, dtype: object

We can also access multiple columns from a dataframe by putting the column names in a list. Note that in this case, you have two layers of hard brackets.

# Access multiple columns
wcup[['Year','Champion']]

	Year	Champion
0	2022	Argentina
1	2018	France
2	2014	Germany
3	2010	Spain
4	2006	Italy
5	2002	Brazil
6	1998	France
7	1994	Brazil
8	1990	Germany
9	1986	Argentina

The indexers `.iloc` and `.loc`#

In Pandas, there are two indexers .iloc and .loc that are often used to access data in a dataframe.

.iloc#

.iloc allows us to access a row or a column using its integer location.

Recall that in a dataframe, by default, to the left of each row are index numbers. The index numbers are similar to the index numbers for a Python list; they help us reference a particular row for data retrieval. Also, like a Python list, the index begins with 0.

We can retrieve data using the .iloc attribute. The syntax of .iloc indexer is df.iloc[row selection, column selection].

# Access a single row
wcup.iloc[5] # Access the row with the index number 5

Year               2002
Champion         Brazil
Host        Korea/Japan
Name: 5, dtype: object

When we select multiple consecutive rows from a dataframe, we give a starting index and an ending index. Notice that the selected rows will not include the final index row.

# Access multiple consecutive rows
wcup.iloc[2:5] # Access the rows with the index number 2, 3, and 4

	Year	Champion	Host
2	2014	Germany	Brazil
3	2010	Spain	South Africa
4	2006	Italy	Germany

# Access multiple non-consecutive rows
wcup.iloc[[0,2,5]] # Access the rows with the index number 0, 2, and 5

	Year	Champion	Host
0	2022	Argentina	Qatar
2	2014	Germany	Brazil
5	2002	Brazil	Korea/Japan

# access every other row in wcup

We have seen how we can access rows from a dataframe using the .iloc indexer. In the following, we will use the .iloc indexer to access columns. Recall that the syntax of .iloc indexer is df.iloc[row selection, column selection]. Again, the index numbers for the columns are similar to the index numbers for a Python list; they help us reference a particular column for data retrieval. Also, like a Python list, the index begins with 0.

# Access a single column
wcup.iloc[:,1] # Access the column with the index number 1

  Argentina
     France
    Germany
      Spain
      Italy
     Brazil
     France
     Brazil
    Germany
  Argentina
Name: Champion, dtype: object

Note that we cannot use the column name, i.e., the header, to access a column because .iloc accesses data using their integer location. If we try to access a column using its column name, we get an error!

# .iloc cannot access a column by its name
wcup.iloc[:,'Champion']

We can use integer slice to access multiple columns from a dataframe.

# Access multiple consecutive columns 
wcup.iloc[:,1:3] # Access the second and third column of the dataframe wcup 

	Champion	Host
0	Argentina	Qatar
1	France	Russia
2	Germany	Brazil
3	Spain	South Africa
4	Italy	Germany
5	Brazil	Korea/Japan
6	France	France
7	Brazil	USA
8	Germany	Italy
9	Argentina	Mexico

# Access multiple non-consecutive columns
wcup.iloc[:,[0,2]] # Access the first and third column of the dataframe wcup 

	Year	Host
0	2022	Qatar
1	2018	Russia
2	2014	Brazil
3	2010	South Africa
4	2006	Germany
5	2002	Korea/Japan
6	1998	France
7	1994	USA
8	1990	Italy
9	1986	Mexico

Now that you know how to select rows and columns from a dataframe using .iloc. You should be able to figure out how to get a slice of a dataframe using .iloc. For example, if you would like to know the champion of the world cup games between 1994 and 2010. How do you slice the dataframe wcup to get the part you are interested in?

# Slice the dataframe using .iloc[ ]

.loc#

While .iloc is integer-based, .loc is label-based. It means that you have to access rows and columns based on their row and column labels.

The syntax of .loc is df.loc[row selection, column selection].

At the moment, the labels for the rows are just their index numbers. When we use .loc to access a row, it will look very similar to what we did with .iloc.

# Access a row using .loc
wcup.loc[0]

Year             2022
Champion    Argentina
Host            Qatar
Name: 0, dtype: object

But we could make our index column customized. For example, we could use the column Year as the index column.

# Set the column 'Year' as the index column
wcup = wcup.set_index('Year')
wcup

	Champion	Host
Year
2022	Argentina	Qatar
2018	France	Russia
2014	Germany	Brazil
2010	Spain	South Africa
2006	Italy	Germany
2002	Brazil	Korea/Japan
1998	France	France
1994	Brazil	USA
1990	Germany	Italy
1986	Argentina	Mexico

After we make the change, we will use the new labels to access the rows.

# Access a row using .loc
wcup.loc[2006]

Champion      Italy
Host        Germany
Name: 2006, dtype: object

# Access multiple consecutive rows
wcup.loc[2018:2010] 

	Champion	Host
Year
2018	France	Russia
2014	Germany	Brazil
2010	Spain	South Africa

Note that with the label search, the ending index row is included.

# Access multiple non-consecutive rows
wcup.loc[[1994, 2002, 2010]]

	Champion	Host
Year
1994	Brazil	USA
2002	Brazil	Korea/Japan
2010	Spain	South Africa

# Access a column
wcup.loc[:, 'Host']

Year
         Qatar
        Russia
        Brazil
  South Africa
       Germany
   Korea/Japan
        France
           USA
         Italy
        Mexico
Name: Host, dtype: object

Now that you know how to select rows and columns from a dataframe using .loc[ ]. You should be able to figure out how to get a slice of a dataframe using .loc[ ]. For example, if you would like to know the champion of the world cup games between 1994 and 2010. How do you slice the dataframe wcup to get the part you are interested in?

# Slice the dataframe using .loc[ ]

As a quick reminder, remember that .iloc[] slicing is not inclusive of the final value. On the other hand, .loc[] slicing does include the final value.

The indexers `.iat` and `.at`#

We have learned how to use the two indexers .iloc and .loc to access rows and columns from a dataframe. In real life, sometimes we only want to access the value in a single cell. In this case, the fastest way is to use the .iat and at indexers. We can now tell from the name that iat provides integer-based lookups while at provides label-based lookups.

Suppose we would like to get the champion country of the 2002 world cup. How do you do that?

# Get the champion country of the 2002 world cup using .at[]
wcup.at[2002, 'Champion']

'Brazil'

# Get the champion country of the 2002 world cup using .iat[]
wcup.iat[5, 0]

'Brazil'

Coding Challenge! < / >

You are a middle school teacher. You have a .csv file that stores the English grades and Math grades of the students in your Butterfly class. Can you use a .read_*() method to read in the data from the file and create a dataframe? After that, can you use .iloc[] or .loc[] to get the Math grades of the first three students from the dataframe? After you get the slice of the dataframe with the Math grades of the first three students, can you write the data into a file of .xlsx format?

# Download the csv file
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_grades.csv'
grades_file = '../data/Grades.csv'
urllib.request.urlretrieve(url, grades_file)
print('Sample file ready.')

# Read in the data and create a dataframe


# Select the Math grades for the first three students


# Write the slice into a file of .xlsx format

Sample file ready.

Set values#

Now that we know how to access data from a dataframe, we could easily use what we have learned to set the values in a dataframe. To do that, we will use assignment statements you have learned in Python Basics.

# Get the data
grades = pd.read_csv('../data/Grades.csv')
grades

	Name	English	Math
0	John Smith	78	80
1	Alex Hazel	80	75
2	Beatrice Dean	72	95
3	Jane White	75	70
4	Eve Lynn	73	82

Use the indexers to set values#

Recall that we have learned how to use the indexers to access a slice of a dataframe. We can use the same indexers to change the values in a dataframe.

For example, we can get the English grades in the first two rows and change the values to 80.

# Get the English grades in the first two rows and change to 80
grades.iloc[:2, 1] = 80

Can you use loc to change the math grades in the last two rows to 90?

# Change the math grades in the last two rows to 90

And of course, we can get a single value from a dataframe and change it using iat or at.

# Get Jane White's math grade and change it to 85
grades.at[3, 'Math'] = 85

It’s your turn. Can you get Eve Lynn’s English grade and change it to 82? This time, however, use iat.

# Get Eve Lynn's math grade and change it to 85

Create a new column based on an existing one#

Let’s get the world cup dataframe again. We add a new column Score to store the scores of the games. Can you set the values in this column to the goals that were scored by the champion in the games?

# Add a new column of score
score = ["7-5", ### put the data in a list
         "4-2", 
         "1-0", 
         "1-0", 
         "6-4", 
         "2-0", 
         "3-0", 
         "3-2", 
         "1-0", 
         "3-2"]
wcup['Score'] = score # make a new column of score
wcup

	Champion	Host	Score
Year
2022	Argentina	Qatar	7-5
2018	France	Russia	4-2
2014	Germany	Brazil	1-0
2010	Spain	South Africa	1-0
2006	Italy	Germany	6-4
2002	Brazil	Korea/Japan	2-0
1998	France	France	3-0
1994	Brazil	USA	3-2
1990	Germany	Italy	1-0
1986	Argentina	Mexico	3-2

In soccer games, it is common to calculate the goals scored and goals conceded by the champion in the final.

# create a new column 'Goals Scored'
wcup['Goals Scored'] = wcup['Score'].str[0]
wcup

	Champion	Host	Score	Goals Scored
Year
2022	Argentina	Qatar	7-5	7
2018	France	Russia	4-2	4
2014	Germany	Brazil	1-0	1
2010	Spain	South Africa	1-0	1
2006	Italy	Germany	6-4	6
2002	Brazil	Korea/Japan	2-0	2
1998	France	France	3-0	3
1994	Brazil	USA	3-2	3
1990	Germany	Italy	1-0	1
1986	Argentina	Mexico	3-2	3

# create a new column 'Goals Conceded'
wcup['Goals Conceded'] = wcup['Score'].str[-1]
wcup

	Champion	Host	Score	Goals Scored	Goals Conceded
Year
2022	Argentina	Qatar	7-5	7	5
2018	France	Russia	4-2	4	2
2014	Germany	Brazil	1-0	1	0
2010	Spain	South Africa	1-0	1	0
2006	Italy	Germany	6-4	6	4
2002	Brazil	Korea/Japan	2-0	2	0
1998	France	France	3-0	3	0
1994	Brazil	USA	3-2	3	2
1990	Germany	Italy	1-0	1	0
1986	Argentina	Mexico	3-2	3	2

With the info on goals scored and conceded by the champion, we can create a column containing the difference between the two.

# Create a new column 'Difference'
wcup['Difference'] = wcup['Goals Scored'] - wcup['Goals Conceded']

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/homebrew/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:218, in _na_arithmetic_op(left, right, op, is_cmp)
    217 try:
--> 218     result = func(left, right)
    219 except TypeError:

File /opt/homebrew/lib/python3.11/site-packages/pandas/core/computation/expressions.py:242, in evaluate(op, a, b, use_numexpr)
    240     if use_numexpr:
    241         # error: "None" not callable
--> 242         return _evaluate(op, op_str, a, b)  # type: ignore[misc]
    243 return _evaluate_standard(op, op_str, a, b)

File /opt/homebrew/lib/python3.11/site-packages/pandas/core/computation/expressions.py:73, in _evaluate_standard(op, op_str, a, b)
     72     _store_test_result(False)
---> 73 return op(a, b)

TypeError: unsupported operand type(s) for -: 'str' and 'str'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[48], line 2
      1 # Create a new column 'Difference'
----> 2 wcup['Difference'] = wcup['Goals Scored'] - wcup['Goals Conceded']

File /opt/homebrew/lib/python3.11/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File /opt/homebrew/lib/python3.11/site-packages/pandas/core/arraylike.py:194, in OpsMixin.__sub__(self, other)
    192 @unpack_zerodim_and_defer("__sub__")
    193 def __sub__(self, other):
--> 194     return self._arith_method(other, operator.sub)

File /opt/homebrew/lib/python3.11/site-packages/pandas/core/series.py:6146, in Series._arith_method(self, other, op)
   6144 def _arith_method(self, other, op):
   6145     self, other = self._align_for_op(other)
-> 6146     return base.IndexOpsMixin._arith_method(self, other, op)

File /opt/homebrew/lib/python3.11/site-packages/pandas/core/base.py:1391, in IndexOpsMixin._arith_method(self, other, op)
   1388     rvalues = np.arange(rvalues.start, rvalues.stop, rvalues.step)
   1390 with np.errstate(all="ignore"):
-> 1391     result = ops.arithmetic_op(lvalues, rvalues, op)
   1393 return self._construct_result(result, name=res_name)

File /opt/homebrew/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:283, in arithmetic_op(left, right, op)
    279     _bool_arith_check(op, left, right)  # type: ignore[arg-type]
    281     # error: Argument 1 to "_na_arithmetic_op" has incompatible type
    282     # "Union[ExtensionArray, ndarray[Any, Any]]"; expected "ndarray[Any, Any]"
--> 283     res_values = _na_arithmetic_op(left, right, op)  # type: ignore[arg-type]
    285 return res_values

File /opt/homebrew/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:227, in _na_arithmetic_op(left, right, op, is_cmp)
    219 except TypeError:
    220     if not is_cmp and (
    221         left.dtype == object or getattr(right, "dtype", None) == object
    222     ):
   (...)    225         # Don't do this for comparisons, as that will handle complex numbers
    226         #  incorrectly, see GH#32047
--> 227         result = _masked_arith_op(left, right, op)
    228     else:
    229         raise

File /opt/homebrew/lib/python3.11/site-packages/pandas/core/ops/array_ops.py:163, in _masked_arith_op(x, y, op)
    161     # See GH#5284, GH#5035, GH#19448 for historical reference
    162     if mask.any():
--> 163         result[mask] = op(xrav[mask], yrav[mask])
    165 else:
    166     if not is_scalar(y):

TypeError: unsupported operand type(s) for -: 'str' and 'str'

We get an error! Why? The error message gives us a hint. The minus operator is not defined for the data type str!

Luckily, Pandas has a convenient method that allows us to convert data types.

# Create a new column 'Difference'
wcup['Difference'] = wcup['Goals Scored'].astype(int) -  wcup['Goals Conceded'].astype(int)
wcup

	Champion	Host	Score	Goals Scored	Goals Conceded	Difference
Year
2022	Argentina	Qatar	7-5	7	5	2
2018	France	Russia	4-2	4	2	2
2014	Germany	Brazil	1-0	1	0	1
2010	Spain	South Africa	1-0	1	0	1
2006	Italy	Germany	6-4	6	4	2
2002	Brazil	Korea/Japan	2-0	2	0	2
1998	France	France	3-0	3	0	3
1994	Brazil	USA	3-2	3	2	1
1990	Germany	Italy	1-0	1	0	1
1986	Argentina	Mexico	3-2	3	2	1

Coding Challenge! < / >

Can you create two new columns in the grades dataframe, one with the students’ first names, one with their last names?

# take a look at the grades df
grades

	Name	English	Math
0	John Smith	80	80
1	Alex Hazel	80	75
2	Beatrice Dean	72	95
3	Jane White	75	85
4	Eve Lynn	73	82

Lesson Complete#

Congratulations! You have completed Pandas Basics 1.

Start Next Lesson: Pandas 2 ->#

Exercise Solutions#

Here are a few solutions for exercises in this lesson.

# Make a dataframe to record English and Math grades of the Butterfly class
butterfly = pd.DataFrame({"Name": ['John Smith', 
                              'Alex Hazel', 
                              'Beatrice Dean', 
                              'Jane White', 
                              'Eve Lynn'],                        
                         "English": [78,
                                    80,
                                    72,
                                    75,
                                    73],
                         "Math": [80,
                                 75,
                                 95,
                                 70,
                                 82]
                         })
butterfly

### Get the math grades of the first three students and write the data into an excel file

# Download the csv file
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_grades.csv'
grades_file = '../data/Grades.csv'
urllib.request.urlretrieve(url, grades_file)

# Read in the data and create a dataframe
butterfly = pd.read_csv(grades_file)

# Select the Math grades for the first three students
butterfly_slice = butterfly.loc[:3, ['Math']]

# Write the slice into a file of .xlsx format
butterfly_slice.to_excel('../data/butterfly_slice.xlsx')

# Can you create two new columns in the grades dataframe 
# one with the students' first names, one with their last names
grades['First Name'] = grades['Name'].str.split().str[0]
grades['Last Name'] = grades['Name'].str.split().str[1]
grades

Pandas Basics 1

Contents

Pandas Basics 1#

Introduction#

Pandas Series and Pandas DataFrame#

Pandas Series#

Pandas DataFrame#

Coding Challenge! < / >

Explore the data#

Read and write tabular data in Pandas#

Access data#

The indexers .iloc and .loc#

.iloc#

.loc#

The indexers .iat and .at#

Coding Challenge! < / >

Set values#

Use the indexers to set values#

Create a new column based on an existing one#

Coding Challenge! < / >

Lesson Complete#

Start Next Lesson: Pandas 2 ->#

Exercise Solutions#

The indexers `.iloc` and `.loc`#

The indexers `.iat` and `.at`#