This notebook is created by Zhuo Chen based on the notebooks created by Nathan Kelber under Creative Commons CC BY License
For questions/comments/improvements, email zhuo.chen@ithaka.org or nathan.kelber@ithaka.org
Python Intermediate 5#
Description: This notebook describes:
What is a generator
How to write a generator comprehension
The advantages of using a generator
Use Case: For Learners (Detailed explanation, not ideal for researchers)
Difficulty: Intermediate
Completion Time: 60 minutes
Knowledge Required:
Python Basics Series (Start Python Basics 1)
Knowledge Recommended: None
Data Format: None
Libraries Used: None
Research Pipeline: None
What is a generator?#
A quick review of iterables in Python#
We have learned from python intermediate 1 that any Python object that allows its members to be iterated over in a for-loop is an iterable. Strings, lists, sets and dictionaries are all iterables.
# Use a for loop to iterate over a list
ls = [1, 2, 3]
for num in ls:
print(num)
1
2
3
# Use a for loop to iterate over a string
s = 'abc'
for l in s:
print(l)
a
b
c
Iterator#
Python has a built-in function iter()
which takes an interable and returns an iterator. The iterator can be used to iterate over the input iterable.
# Use the built-in iter function to create an iterator out of the list stored in ls
my_ls = iter(ls)
type(my_ls)
list_iterator
To access the values in the original list from this iterator, we need to use the next()
function to get one value at a time.
# Use next() to get the first element from the list
next(my_ls)
1
# Use next() to get the second element from the list
next(my_ls)
2
Generator#
A generator is a function that creates an iterator. Since an iterator yields one item at a time, we can define the simplest kind of generator using the following code.
# Define a very simple generator
def simple_gen():
yield 1
yield 2
yield 3
You can use a for loop to iterate through the items in an iterator created by a generator. In this sense, an iterator is also an iterable in Python.
# Use a for loop to iterate through the items
# and print them out
for i in simple_gen():
print(i)
1
2
3
You can use the next()
function to see that this simple generator actually yields one item at a time.
# Assignt the iterator to a variable
gen = simple_gen()
# yield the first item
next(gen)
1
# yield the second item
next(gen)
2
# yield the third item
next(gen)
3
On the surface, generators look like ordinary functions, but they are actually very different. Let’s use a simple example to understand the difference.
# Create a Python function which takes a list of numbers
# and returns a list of numbers, each of which is two times
# of the numbers in the input list
def two_times(ls):
"""takes in a list of numbers and return a list of numbers, each of which
is two times of the numbers in the input list"""
new_ls = []
for n in ls:
new_ls.append(2*n)
return new_ls
two_times([1, 2, 3])
[2, 4, 6]
If we feed a list of numbers to this function, we get a new list back. Most importantly, the entire new list of numbers is stored in the memory.
We can also create a Python generator to give us the same sequence of values. Note that a generator uses the yield
statement.
# Create a Python generator
def gen(ls):
"""takes in a list of numbers and create a generator which produces a list of numbers,
each of which is two times of the numbers in the input list"""
for n in ls:
yield 2*n
my_gen = gen([1, 2, 3])
Since a generator creates an iterator, the values will be yielded one at a time.
# Use next () to yield one element from the iterable at a time
next(my_gen)
2
# Use next () to yield one element from the iterable at a time
next(my_gen)
4
# Use next () to yield one element from the iterable at a time
next(my_gen)
6
The generator is exhausted when all the items have been used. If we use next()
function again, Python returns a StopIteration
error.
# Use next () to yield one element from the iterable at a time
next(my_gen)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
Cell In[17], line 2
1 # Use next () to yield one element from the iterable at a time
----> 2 next(my_gen)
StopIteration:
Lists and generators#
A list stores all of its members. We can access any of its members via indexing. A generator, however, does not store any items. What it stores is the instructions for how to generate each of its members as well as the iteration state. For example, if a generator has generated its first member, it knows that it should generate its second member the next time.
The built-in generators#
Python has some built-in generators. You may not be aware of it, but you have actually used a built-in generator in the Python Basics series. It is the enumerate()
function.
# An example from Python basics 3
# which uses enumerate()
staff = ['Tara Richards',
'John Smith',
'Justin Douglas',
'Lauren Marquez',
'John Smith']
# Use the enumerate function
for index, name in enumerate(staff):
if name == 'John Smith':
print(index)
1
4
# Confirm that enumerate() is a generator
staff_gen = enumerate(staff)
# yield the first item
next(staff_gen)
(0, 'Tara Richards')
# yield the second item
next(staff_gen)
(1, 'John Smith')
Coding Challenge! < / >
Create a string and then create a generator that can take the string as input.
Generator comprehension#
Python provides a shorter way to define a generator function, that is, generator comprehensions.
Generator comprehensions basically have the same syntax as list comprehensions, except that they use parentheses ()
instead of hard brackets []
.
Let’s first quickly review how to write a list comprehension.
# Create a list comprehension using hard brackets []
numbers = [5,6,7,8,9]
new_list = [num for num in numbers if num > 5]
print(new_list)
[6, 7, 8, 9]
Then, let’s create a generator which will generate the same sequence of values as the new list above, but only one at a time.
# Create a generator using parentheses
new_gen = (num for num in numbers if num > 5)
# Yield the values one at a time
next(new_gen)
6
next(new_gen)
7
next(new_gen)
8
next(new_gen)
9
Again, when all the items have been yielded, if we use next()
function again, Python returns a StopIteration
error.
# Yield the next generator output
next(new_gen)
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
Cell In[28], line 2
1 # Yield the next generator output
----> 2 next(new_gen)
StopIteration:
Recall that list comprehension can create a list based on any kind of iterables in Python. This is true for generator comprehension as well. In the previous example, we created a generator based on a list. In the code cell, let’s create a generator based on a dictionary using generator comprehension.
# Create a generator based on a dictionary using
# generator comprehension
contacts ={
'Amanda Bennett': 'Engineer, electrical',
'Bryan Miller': 'Radiation protection practitioner',
'Christopher Garrison': 'Planning and development surveyor',
'Debra Allen': 'Intelligence analyst',
'Donna Decker': 'Architect',
'Heather Bullock': 'Media planner',
'Jason Brown': 'Energy manager',
'Jason Soto': 'Lighting technician, broadcasting/film/video',
'Marissa Munoz': 'Further education lecturer',
'Matthew Mccall': 'Chief Technology Officer',
'Michael Norman': 'Translator',
'Nicole Leblanc': 'Financial controller',
'Noah Delgado': 'Engineer, land',
'Rachel Charles': 'Physicist, medical',
'Stephanie Petty': 'Architect'}
contact_gen = (name for name, occupation in contacts.items() if 'Engineer' in occupation)
# Yield the first item
next(contact_gen)
'Amanda Bennett'
# Yield the second item
next(contact_gen)
'Noah Delgado'
The advantages of generators#
Generators do not hold the entire result in the memory. It yields one item at a time. Because a generator only has to yield one item at a time, it can lead to significant savings in memory usage.
# Demonstrate the memory size difference of
# a list comprehension vs generator comprehension
# Import getsizeof which measures memory usage in bytes
from sys import getsizeof
list_comprehension = [i for i in range(10000)]
generator_comprehension = (i for i in range(10000))
# Print the size of the list comprehension
print('List comprehension memory usage: ', getsizeof(list_comprehension))
# Print the size of the generator comprehension
print('Generator comprehension memory usage: ', getsizeof(generator_comprehension))
List comprehension memory usage: 85176
Generator comprehension memory usage: 200
Since a generator occupies less memory, using a generator instead of a normal iterable like a list can lead to a performace boost. This advantage in performance is especially helpful when you have a really big dataset with hundreds of thousands of items or even millions of items to loop through.
# import the time module to calculate the processing time
import time
# Calculate the processing time in milliseconds when we create a list with 1m items
def ml(n):
ls = []
for i in range(n):
ls.append(n)
return ls
start = time.process_time()*1000
ml(1000000)
end = time.process_time()*1000
print(end - start)
31.51899999999989
# Calculate the processing time in milliseconds when we create a generator with 1m items
def ml_gen(n):
for i in range(n):
yield i
start = time.process_time()*1000
ml_gen(1000000)
end = time.process_time()*1000
print(end - start)
0.021999999999934516
Using a generator makes sense in scenarios where loading an entire list, dictionary, or set could fill all available memory. This could be because each item is large, the list is large, or both.
If you want to take one item at a time, do a lot of calculations based on that item, and then move on to the next item, then use a generator.
Coding Challenge! < / >
Create a generator object using a generator comprehension
Print out every value in the generator
Use
try
andexcept
in your code to prevent the program from crashing after the generator is exhausted
For a quick refresh of try
and except
, you can refer to python basics 2.
# Create a generator using a generator comprehension
An example of a generator from Constellate#
In Constellate, when you build a dataset and use the Constellate client to download the dataset, you will be working with a generator.
# import modules and libraries
import constellate
from pathlib import Path
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
# Check to see if a dataset file exists
# If not, download a dataset using the Constellate Client
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_file = Path.cwd() / '..'/ 'data' / 'Shakespeare' # Make sure this filepath matches your dataset filename
if dataset_file.exists() == False:
try:
dataset_file = constellate.download(dataset_id, 'jsonl', 'Shakespeare')
except:
dataset_file = constellate.get_dataset(dataset_id)
# Read in the data
dataset = constellate.dataset_reader(dataset_file)
# Check the type of 'dataset'
type(dataset)
# Get the first document using next()
next(dataset)
We have in total 6745 documents in the dataset. Quite a lot!
# Calculate the processing time of the generator in milliseconds
start = time.process_time() * 1000
dataset = constellate.dataset_reader(dataset_file)
end = time.process_time() * 1000
print(end - start)
# Calculate the processing time of the list with the same items in milliseconds
start = time.process_time() * 1000
dataset = list(constellate.dataset_reader(dataset_file))
end = time.process_time() * 1000
print(end - start)
Lesson Complete#
Congratulations! You have completed Python Intermediate 5.
Exercise Solutions#
Here are a few solutions for exercises in this lesson.
# Pick an iterable of your choice and write a generator which takes the iterable as its input
w = "generator"
def gen(w):
for l in w:
yield l.upper()
w_gen = gen(w)
w
'generator'
# Create a generator using a generator comprehension
gen = (number for number in range(30))
# Print the rest of the values using a loop
while True:
try:
print(next(gen))
except StopIteration:
print('Generator exhausted')
break
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Generator exhausted