Created by Nathan Kelber under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Regular Expressions#

Description: This lesson introduces the re module for analyzing strings with regular expressions. Students will be able to:

  • Create regular expressions

  • Create a Regex object with re.compile()

  • Use the findall() and finditer() methods to return a match object

  • Return strings of the actual matched text

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Beginner

Completion Time: 90 minutes

Knowledge Required:

Knowledge Recommended: None

Data Format: None

Libraries Used: re

Research Pipeline: None


Introduction#

Regular expressions can be used to locate particular characters or sequences of characters in a string. For example, a regular expression could be written to identify phone numbers, email addresses, or particular names. Far beyond simply matching a known string, regular expressions can be written to find complex patterns in a text. They are often useful when the documents being searched are very long.

Regular expressions can be used in Python, but also in many other applications such as other programming languages, word processing software (Microsoft Word, Google Docs), or email. Crafting the right regular expression can be very difficult, but can often save hours of labor for many menial tasks. When crafting a regular expression, it can be very helpful to use a tool like RegExr that demonstrates how expressions are being matched on a few sample texts as you type them. The tailored expression could then be implemented in a fuller solution with Python.

Practicing with regexr#

We could practice our regular expressions with Python, but it will be easier, faster, and more interpretable to use a dedicated regular expression tool like regexr.com. We can test our expression on sample data before applying it with Python to a full dataset.

Metacharacters#

When executing a search pattern, regular expressions make use of special metacharacters. If you would like to search for one of these characters in your text, you will need to add a backslash () before the character:

. ^ $ * + ? { } [ ] \ | ( )

Copy and paste all of these expressions into regexr, and then try matching on some of them.


Try it! < / >

Use the dollar sign to add a dollar amount, such as $5.00, to regexr. Can you write a regular expression to match it?


Character Classes#

Using the metacharacters in our search pattern will allow us to search particular classes of characters.

Expression

Matches

.

Any character except a new line \n

\d

A digit (0-9)

\D

Not a digit

\w

Word character (a-z, A-Z, 0-9, _)

\W

Not a word character, not a new line

\s

Whitespace (space, tab new line)

\S

Not a whitespace

Try each of these with regexr using the following sample text:


Mr. alex arvison
work+arvison0@aol.com
323-423-4353

Mrs Dara Batha
d.batha1@bright.edu
102.343.3784

Ms T Lamcken
tlamcken-2@usda.gov
444|343|4387

Ms. M. Picardo
mpicardo_7@simplemachines.org
439|963|6284

Coding Challenge! < / >

Write a regular expression that will match all the phone numbers.


Anchors#

An anchor helps search particular text areas, such as string beginnings, string endings, or word boundaries.

Expression

Matches

\b

Word boundary

\B

Not a word boundary

^

Beginning of string

$

End of string


Try it! < / >

Use the word boundary \b and not word boundary \B to search for the string ar. What is the difference in the results?


Character sets#

We can define a set of potential characters to match by putting them in brackets [].

Expression

Matches

[ ]

Characters in brackets

[^ ]

Characters not in brackets

We can specify exact characters to match:

  • [.,-] Match a period, comma, or dash

  • [rs] Match the lowercase letter r or s

  • [^t] Match any character that is not lowercase t

or we can specify a range to match, such as:

  • [A-Z] Match any capital letter, from A to Z

  • [A-F] Match any capital letter, from A to F

  • [a-z] Match any lowercase letter, a to z

  • [A-fa-f] Match any letter, regardless of case from A to F

  • [0-3] Match any number, from 0-3


Coding Challenge! < / >

Write a regular expression that will match the first two phone numbers with dashes or periods but not the final phone numbers with vertical pipes |.

Mr. alex arvison
work+arvison0@aol.com
323-423-4353

Mrs Dara Batha
d.batha1@bright.edu
102.343.3784

Ms T Lamcken
tlamcken-2@usda.gov
444|343|4387

Ms. M. Picardo
mpicardo_7@simplemachines.org
439|963|6284

Quantifiers#

Quantifiers let us repeat a character match for some additional number of characters.

Expression

Matches

*

0 or more

+

1 or more

?

0 or 1

{4}

Exact number

{3,6}

Minimum to maximum range

For example, we could match phone numbers by using:

\d\d\d.\d\d\d.\d\d\d\d

or we could write this with a quantifier as:

\d{3}.\d{3}.\d{4}


Coding Challenge! < / >

Write a regular expression that will match the full names for each person.

Mr. alex arvison
work+arvison0@aol.com
323-423-4353

Mrs Dara Batha
d.batha1@bright.edu
102.343.3784

Ms T Lamcken
tlamcken-2@usda.gov
444|343|4387

Ms. M. Picardo
mpicardo_7@simplemachines.org
439|963|6284

Groups#

We can also use groups to specify sections of a regular expression. This can be handy if we want to return parts of the regular expression in chunks or if we want to specify a set of possibilities for a particular character or set of characters.

Expression

Matches

(A|B|C)

Capital A or capital B or capital C

In our example above, we could use a group to match a variety of honorifics.

(Mrs|Ms|Mr|Mx|Dr).? would match a variety honirifics with or without a trailing period.


Coding Challenge! < / >

Write a regular expression that will match the full email addresses.

Mr. alex arvison
work+arvison0@aol.com
323-423-4353

Mrs Dara Batha
d.batha1@bright.edu
102.343.3784

Ms T Lamcken
tlamcken-2@usda.gov
444|343|4387

Ms. M. Picardo
mpicardo_7@simplemachines.org
439|963|6284

Using the re module#

The re module offers a great deal of flexibility in working with regular expressions. The workflow for using re generally follows this format:

  1. Import the re module and put the text being searched into a string

  2. Create a Regex object with re.compile()

  3. Pass the string into the compiled Regex object using a method such as:

    • .findall()

    • .finditer()

  4. Return the matches

Let’s examine these steps in a little more detail.

Import the re module and put the search text into a string#

Import the re module with import re

Create a variable containing the string object to be searched. This could be loaded from a file, such as a text, CSV, or JSON file. (For information on loading data from a file in Python, see Python Intermediate 2.)

Create the Regex object with re.compile()#

Compiling the Regex Object establishes the pattern to search for. This is where we add in the regular expression string that we crafted using regexr. Now that we are familiar with the syntax of regular expression strings, it is important to note that they often contain backslash () characters, which can easily be confused with escape characters. For this reason, it is usually a good practice to use a raw string for passing your regular expression into re.compile() . A raw string starts with a r and skips over any escape characters, such as a new line character \n.

# A demonstration of a regular string with an escape character
string = 'Regular string: \n A new line is created. \n'
print(string)

# A demonstration of a raw string where the escape character is ignored
raw_string = r'Raw string: \n The new line escape character is ignored.'
print(raw_string)

Technically, it is not always necessary to use re.compile() to create a Regex Object, but doing so will make your matches go faster. On small documents, the difference is insignificant, but it is a good practice since it will improve the speed of larger searches.

Pass the string to be searched into the Regex Object#

The Regex Object in the last step established the pattern for the search. In this step, we pass the string to be searched with the Regex Object pattern.

The re module includes a variety of methods including:

  • .findall() Return all non-overlapping pattern matches as list of strings or tuples. Will return match groups if the pattern contains groups.

  • .finditer() Return an iterator that yields match objects over all non-overlapping matches.

Additional methods are documented in the official Python re documentation.

Return the Matches#

The final step is to return the matches for the search. The process and outputs are slightly different for .findall() and .finditer() methods. There are also additional methods described in the official Python re documentation.

A basic example with .findall()#

# Import the re module
import re

# The text to search
text = '''
Mr. alex arvison
work+arvison0@aol.com
323-423-4353

Mrs Dara Batha
d.batha1@bright.edu
102.343.3784

Ms T Lamcken
tlamcken-2@usda.gov
444|343|4387

Ms. M. Picardo
mpicardo_7@simplemachines.org
439|963|6284
'''
# Compile a Regex Object
# Search for the word quick
pattern = re.compile(r'\d{3}.\d{3}.\d{4}')
# Use the `.findall()` method to gather all the matches into a list
matches = pattern.findall(text)
# Print the list of matches
print(matches)

If the expression passed into re.compile() contains no groups, then the output will be a list of matching strings. If the expression does contain groups, the output will be a list of tuples containing only the matching groups.

# Grouping by Honorific, First Name, Last Name
pattern = re.compile(r'(M[rs]+\.?)\s(\w+.?)\s(\w+)')

matches = pattern.findall(text)
print(matches)

A basic example with .finditer()#

# Compile a Regex Object
# Search for the word quick
pattern = re.compile(r'(M[rs]+\.?)\s(\w+.?)\s(\w+)')

# Use the `.finditer()` method to gather all the matches
# into an iterable "match object".
matches = pattern.finditer(text)

# Iterate over the matches and print them out
for match in matches:
    print(match)

When using the .finditer() method, each match object contains two important pieces of information:

span
The starting and ending index number for the match within the searched string.

match
The actual characters from the string which fulfilled the Regex Object match.

# Verifying the index number slice for the match
print(text[154:168])

When using finditer(), the groups within a match can be referenced using the .group() method.

  • .group(0) returns the full match

  • .group(1) returns the first group

  • .group(2) returns the second group

# Compile a Regex Object
# Search for the word quick
pattern = re.compile(r'(M[rs]+\.?)\s(\w+.?)\s(\w+)')

# Use the `.finditer()` method to gather all the matches
# into an iterable "match object".
matches = pattern.finditer(text)

# Iterate over the matches and print them out
for match in matches:
    print(match.group(0))

re.compile flags for verbose patterns and ignoring case#

The re.compile() method can accept flags. Two of the most useful are:

  • re.VERBOSE to allow commenting within regular expressions

  • re.I to ignore case when matching

Verbose mode flag#

Passing the re.VERBOSE flag as a second argument into re.compile() will allow you to include comments inside your regular expression similar to a comment in Python. Any text after a # will be ignored for the purposes of matching. This can be very useful for documenting and explaining complex regular expressions.

Our name matching:

pattern = re.compile(r'M[rs]+\.?\s\w+.?\s\w+')

Using a multi-line comment with a triple quote allows for comments that break the expression into chunks while also offering room for explanation.

pattern = re.compile(r'''(
    (M[rs]+\.?)    # Honorific
    \s             # Space
    (\w+.?)        # First name
    \s             # Space
    (\w+)          # Last name
    )''', re.VERBOSE)
# Compile a Regex Object using the Verbose flag
pattern = re.compile(r'''(
    (M[rs]+\.?)    # Honorific
    \s             # Space
    (\w+.?)        # First name
    \s             # Space
    (\w+)          # Last name
    )''', re.VERBOSE)

# Use the `.finditer()` method to gather all the matches
# into an iterable object. This is not a list.
matches = pattern.finditer(text)

# Iterate over the matches and print them out
for match in matches:
    print(match)

Ignore case flag#

Passing the re.I flag as a second argument into re.compile() will ignore the case of matches.

# Import the re module
import re

# The string to search
text = 'Constellate CONSTELLATE constellate COnsTeLlate'

# Compile a Regex Object
# Search for the word constellate
pattern = re.compile(r'constellate', re.I)

# Use the `.finditer()` method to gather all the matches
matches = pattern.finditer(text)

# Iterate through the matches and print each one
for match in matches:
    print(match)

Research example#


Coding Challenge! < / >

Write a regular expression that will match all the stage directions in a TEI XML file. The stage tags are marked with opening <stage> and closing </stage> tags.


import urllib.request
from pathlib import Path

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

# Download a sample text file
# This TEI XML playtext comes from Oxford University and
# The Bodleian First Folio (https://firstfolio.bodleian.ox.ac.uk/downloads.html#pdfs)
urllib.request.urlretrieve('https://firstfolio.bodleian.ox.ac.uk/download/xml/F-rom.xml', '../data/romeoandjuliet.xml')
# Import the re module
import re

# A text to search
with open('../data/romeoandjuliet.xml', 'r') as f:
    text = f.read()

# Compile a Regex Object
# Search for the word quick
pattern = re.compile() # insert a regex pattern here to match the stage tags

# Use the `.finditer()` method to gather all the matches
matches = pattern.finditer(text)

# Iterate through the matches and print each one
for match in matches:
    print(match)

Lesson Complete#

Congratulations on completing the Constellate course in regular expressions!

Coding Solutions#

Here are a few solutions for exercises in this lesson. Many other solutions are possible!

Mr. alex arvison
work+arvison0@aol.com
323-423-4353

Mrs Dara Batha
d.batha1@bright.edu
102.343.3784

Ms T Lamcken
tlamcken-2@usda.gov
444|343|4387

Ms. M. Picardo
mpicardo_7@simplemachines.org
439|963|6284

Match all phone numbers#

\d\d\d.\d\d\d.\d\d\d\d

Match first two phone numbers#

\d\d\d[-.]\d\d\d[-.]\d\d\d\d

Match the full name for every person#

M[rs]+\.?\s\w+.?\s\w+

Match the full email address for every person#

[A-Za-z0-9_+-.]+@[A-Za-z0-9_+-.]+\.(com|edu|gov|org)

[A-Za-z0-9_+-.]+@[A-Za-z0-9-]+\.[A-Za-z0-9.-]+

Stage tags#

# Import the re module
import re

# A text to search
with open('data/romeoandjuliet.xml', 'r') as f:
    text = f.read()

# Compile a Regex Object
# Search for the word quick
pattern = re.compile(r'<stage.*?>(.*?)</stage>', re.I)

# Use the `.finditer()` method to gather all the matches
matches = pattern.finditer(text)

# Iterate through the matches and print each one
for match in matches:
    print(match.group(1))