<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
____


# Tokenizers

**Description:**
This notebook focuses on the basic concepts surrounding tokenization. It includes material on the following concepts:

* Word segmentation
* n-grams
* Stemming
* Lemmatization
* Tokenizers

**Knowledge Required:** 
* Python Basics ([Start Python Basics 1](../../PythonForDataAnalysis/GettingStarted/basic/python-basics-1.ipynb))

**Knowledge Recommended:** 
* [Python Intermediate 2](../../PythonForDataAnalysis/GettingStarted/intermediate/python-intermediate-2.ipynb)

___

## What is a word?

The concept of a word makes intuitive sense in everyday language, but it starts to break down significantly when we begin trying to formalize it for analysis with computer programs. Linguists have spent decades creating formal rules for breaking down texts into smaller parts for analysis, dealing in great detail with the normally unspoken rules of grammar. In this lesson, we consider what a word is and consider how we could write a program for collecting the words within a text.

Let's take a look at an example sentence:

> Now that summer's here, we're going to visit the beach at Lake Michigan and eat ice cream.

How many words are in this sentence? We could start by simply looking at words that are separated by spaces. 

> Now, that, summer's, here, we're, going, to, visit, the, beach, at, Lake, Michigan, and, eat, ice, cream.

That would give us 17 words. But we could ask a few questions about this count. For example, is 'Lake Michigan' one word or two words? Certainly, lake and Michigan have their own individual meanings, but Lake Michigan certainly has a different meaning from either of those words individually. Similarly, what about 'ice cream'?

What about contractions? Is 'we're' a single word or two words: 'we' and 'are'? If our goal is to count how many times a given word occurs in the sentence, does 'we' occur in the sentence? Does the word 'summer' occur in our sentence?

Verb conjugations pose yet another problem. Should the word 'going' be counted separately from 'go'. What about 'went'? From a computational linguistics perspective, we could 'stem' words, simply lopping off the 'ing' from 'going' to get 'go'. But that would poses some serious programming challenges for words like 'running' where the base form is 'run' instead of 'runn'. And we might run into issues with words 'sing' or 'singing' that should not have 'ing' removed in the former case but once in the later case. How could we distinguish between words that are conjugated, like'sings', and words that are plural like 'wings'. Sometimes an -s ending is plural (fens) and other times it is not (lens).

## Tokenization

Tokenization, or segmenting a text into word chunks, is the first part of a Natural Language Processing pipeline. Tokens can be sentences, words, or sub-word chunks. The tokenization process involves many practical decisions, and this has led to many different methods that are reflected by a variety of available tokenizers. A tokenizer takes a text as input and generated tokens as output automatically.

In the case of tokenizing words, this is traditionally done by splitting on whitespace and punctuation. (There are more advanced tokenization methods for language models such as BERT and GPT. These include Byte-Pair Encoding, WordPiece, and SentencePiece.) We will look at a few examples of traditional tokenizers with a goal of gathering tokens into one-, two-, and three-word constructions. The general name for these is n-grams.

An n-gram is a sequence of n items from a given sample of text or speech. Most often, this refers to a sequence of words, but it can also be used to analyze text at the level of syllables, letters, or phonemes. N-grams are often described by their length. For example, word n-grams might include:

* stock (a 1-gram, or unigram)
* vegetable stock (a 2-gram, or bigram)
* homemade vegetable stock (a 3-gram, or trigram)

A text analysis approach that looks only at unigrams would not be able to differentiate between the "stock" in "stock market" and "vegetable stock." By including bigrams and trigrams in our analysis, we are able to look at concepts that extend across multiple words. One of the most popular examples of text analysis with n-grams is the [Google N-Gram Viewer](https://books.google.com/ngrams).

## Constellate Datasets

The Constellate dataset builder has a historical term frequency viewer that is similar to the Google N-Gram Viewer. For example, we could create a dataset of medical journals and see how common particular terms are over time. 

![The Constellate Term Frequency Viewer showing diseases represented in medical journals in the 20th century](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/frequency-viewer.png)

The Constellate term frequency viewer will graph frequencies for bigrams and trigrams as well.

![The Constellate Term Frequency Viewer showing the frequency of different kinds of fevers whose names are bigrams](https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/frequency-viewer-2.png)

Building a dataset triggers a process that gathers up all the unigrams, bigrams, and trigrams for the documents you've selected. We are able to supply these n-gram lists with their accompanying metadata for any source, even if the materials are under copyright. This is the essence of a "non-consumptive" dataset. The researcher can access the n-grams but not the underlying full-text. In cases where there are no copyright restrictions, we also supply the full-text of the material.

The materials are available for download and analysis in several dataset types. The most complete type is a JSON-Lines file which contains all of the data we can legally provide. Many of the notebooks we offer rely on this data formatand make it easy to accomplish common text analysis tasks such as counting word frequencies, creating word clouds, significant terms weighting, and topic modeling. 

We can create our own Constellate-compatible datasets from any texts by extracting the unigrams, bigrams, trigrams, and full text. We would then simply need to put them into the appropriate form matching the Constellate data schema. Then we could run the analyses mentioned above on our own texts. This notebook focuses on the tokenization processes to gather the unigrams, bigrams, and trigrams.

## Creating your own basic tokenizer

It is possible to create your own basic tokenizer by using Python string methods. The following example uses the `.split()` method to gather unigrams.

In [1]:
# Download Shakespeare's Othello from Project Gutenberg
import urllib.request
from pathlib import Path

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

text_address = 'https://www.gutenberg.org/cache/epub/1531/pg1531.txt'
text_name = '../data/' + text_address.rsplit('/', 1)[-1]
urllib.request.urlretrieve(text_address, text_name)

('../data/pg1531.txt', <http.client.HTTPMessage at 0x1040b1d10>)

In [2]:
# Opening a file in read mode
with open(text_name, 'r') as f:
    text = f.read()
    print(text)

The Project Gutenberg eBook of Othello, the Moor of Venice
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Othello, the Moor of Venice

Author: William Shakespeare

Release date: November 1, 1998 [eBook #1531]
                Most recently updated: December 16, 2023

Language: English

Credits: the PG Shakespeare Team, a team of about twenty Project Gutenberg volunteers


*** START OF THE PROJECT GUTENBERG EBOOK OTHELLO, THE MOOR OF VENICE ***




cover




OTHELLO, THE MOOR OF VENICE

by William Shakespeare




Contents

 ACT I
 Scene I. Venice. A street
 Scene II. Venice. Another str

In [3]:
# See the raw string version of our text
text

'\ufeffThe Project Gutenberg eBook of Othello, the Moor of Venice\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: Othello, the Moor of Venice\n\nAuthor: William Shakespeare\n\nRelease date: November 1, 1998 [eBook #1531]\n                Most recently updated: December 16, 2023\n\nLanguage: English\n\nCredits: the PG Shakespeare Team, a team of about twenty Project Gutenberg volunteers\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK OTHELLO, THE MOOR OF VENICE ***\n\n\n\n\ncover\n\n\n\n\nOTHELLO, THE MOOR OF VENICE\n\nby William Shakespeare\n\n\n\n\nContents\n\n ACT I\n Scene I

In [4]:
# Splitting the text string into a list of strings
tokenized_list = text.split()
list(tokenized_list)

['\ufeffThe',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Othello,',
 'the',
 'Moor',
 'of',
 'Venice',
 'This',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'in',
 'the',
 'United',
 'States',
 'and',
 'most',
 'other',
 'parts',
 'of',
 'the',
 'world',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever.',
 'You',
 'may',
 'copy',
 'it,',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'ebook',
 'or',
 'online',
 'at',
 'www.gutenberg.org.',
 'If',
 'you',
 'are',
 'not',
 'located',
 'in',
 'the',
 'United',
 'States,',
 'you',
 'will',
 'have',
 'to',
 'check',
 'the',
 'laws',
 'of',
 'the',
 'country',
 'where',
 'you',
 'are',
 'located',
 'before',
 'using',
 'this',
 'eBook.',
 'Title:',
 'Othello,',
 'the',
 'Moor',
 'of',
 'Venice',
 'Author:',
 'William',
 'Shakespeare',
 'Release',
 'date:',
 'N

In [5]:
# Cleaning up the tokens
unigrams = []

for token in tokenized_list:
    token = token.lower() # lowercase tokens
    token = token.replace('.', '') # remove periods
    token = token.replace('!', '') # remove exclamation points
    token = token.replace('?', '') # remove question marks
    unigrams.append(token)

In [6]:
# Preview the unigrams
list(unigrams)

['\ufeffthe',
 'project',
 'gutenberg',
 'ebook',
 'of',
 'othello,',
 'the',
 'moor',
 'of',
 'venice',
 'this',
 'ebook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'in',
 'the',
 'united',
 'states',
 'and',
 'most',
 'other',
 'parts',
 'of',
 'the',
 'world',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 'you',
 'may',
 'copy',
 'it,',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'project',
 'gutenberg',
 'license',
 'included',
 'with',
 'this',
 'ebook',
 'or',
 'online',
 'at',
 'wwwgutenbergorg',
 'if',
 'you',
 'are',
 'not',
 'located',
 'in',
 'the',
 'united',
 'states,',
 'you',
 'will',
 'have',
 'to',
 'check',
 'the',
 'laws',
 'of',
 'the',
 'country',
 'where',
 'you',
 'are',
 'located',
 'before',
 'using',
 'this',
 'ebook',
 'title:',
 'othello,',
 'the',
 'moor',
 'of',
 'venice',
 'author:',
 'william',
 'shakespeare',
 'release',
 'date:',
 'novemb

In [7]:
# Count up the tokens using a Counter() object
from collections import Counter
word_counts = Counter(unigrams)
print(word_counts)

Counter({'the': 953, 'and': 852, 'i': 817, 'to': 659, 'of': 579, 'a': 513, 'you': 478, 'my': 426, 'in': 400, 'that': 372, 'not': 328, 'iago': 321, 'is': 306, 'othello': 302, 'it': 291, 'with': 267, 'for': 253, 'this': 248, 'me': 233, 'be': 233, 'do': 228, 'your': 216, 'but': 212, 'he': 211, 'have': 207, 'desdemona': 200, 'cassio': 198, 'her': 181, 'his': 170, 'as': 169, 'if': 160, 'or': 158, 'she': 154, 'will': 153, 'what': 153, 'him': 142, 'so': 138, 'thou': 136, 'by': 134, 'are': 129, 'emilia': 119, 'on': 107, 'all': 103, 'shall': 97, 'from': 92, 'am': 88, 'roderigo': 87, '’tis': 86, 'good': 85, 'project': 84, 'no': 84, 'how': 82, 'at': 81, 'thy': 80, 'would': 80, 'o': 75, 'was': 72, 'some': 72, 'let': 72, 'may': 71, 'they': 71, 'such': 69, 'did': 67, 'must': 67, 'more': 66, 'o,': 65, 'you,': 65, 'hath': 65, 'now': 64, 'enter': 63, 'most': 61, 'yet': 61, 'love': 60, 'know': 59, 'any': 59, 'we': 59, 'lord': 59, 'had': 58, 'here': 57, 'go': 57, 'an': 56, 'say': 56, 'make': 55, 'upon': 

## NLTK

While writing your own tokenizer may allow you to create highly customized results, it is easier and more often more effective to use existing tokenizers offered in packages such as the Natural Language Toolkit (NLTK) and spaCy. Ultimately, whatever tokenizer you use, it is helpful to understand Python string manipulations and regular expressions in case you wish to adapt a particular tokenizer to your texts. 


The NLTK library has multiple tokenizers available.

### [Word Punctuation](https://www.nltk.org/_modules/nltk/tokenize/punkt.html)
The word punctuation tokenizer splits on white spaces and splits out punctuation into separate tokens.

### [Penn Treebank](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)
The Tree Bank tokenizer is the default tokenizer for NLTK. It features a variety of regular expressions for addressing punctuation such as contractions, quotes, parentheses, brackets, and dashes.

### [Tweet](https://www.nltk.org/_modules/nltk/tokenize/casual.html#TweetTokenizer)
The Twitter tokenizer is designed to work with Twitter and social media text. It uses regular expressions for addressing emoticons, phone numbers, URLs, Twitter usernames, and email addresses.

### [Multi-Word Expression](https://www.nltk.org/_modules/nltk/tokenize/mwe.html)
The MWETokenizer takes a "string which has already been divided into tokens and retokenizes it, merging multi-word expressions into single tokens, using a lexicon of MWEs." The lexicon of Multi-Word Entities is constructed by the user. It can be constructed ad-hoc depended on the user's research interest or discovered through the use of techniques like part of speech tagging, collocation, and named entity recognition.

In [8]:
# Import a variety of tokenizers
import nltk
nltk.download('punkt', download_dir='../data/nltk_data')
nltk.download('averaged_perceptron_tagger', download_dir='../data/nltk_data')
from nltk.tokenize import (TreebankWordTokenizer,
                          word_tokenize,
                          wordpunct_tokenize,
                          TweetTokenizer,
                          MWETokenizer)

[nltk_data] Downloading package punkt to ../data/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     ../data/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [9]:
string = "Nathan Kelber is helping us tokenize with the Constellate platform. http://constellate.org #NLP"

In [10]:
# Python .split() tokenization
split_tokens = string.split()
print('Python .split()')
print(split_tokens, '\n')

# Punctuation-based tokenization
punct_tokens = wordpunct_tokenize(string)
print('Wordpunct tokenizer')
print(punct_tokens, '\n')

# Treebank Tokenizer
treebank_tokens = TreebankWordTokenizer().tokenize(string)
print('Treebank Tokenizer')
print(treebank_tokens, '\n')

# TweetTokenizer
tweet_tokens = TweetTokenizer().tokenize(string)
print('Tweet Tokenizer')
print(tweet_tokens, '\n')

# Multi-Word Expression Tokenizer
tokenizer = MWETokenizer([('Nathan', 'Kelber')])
MWE_tokens = tokenizer.tokenize(word_tokenize(string))
print('MWE Tokenizer')
print(MWE_tokens)

Python .split()
['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform.', 'http://constellate.org', '#NLP'] 

Wordpunct tokenizer
['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform', '.', 'http', '://', 'constellate', '.', 'org', '#', 'NLP'] 

Treebank Tokenizer
['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform.', 'http', ':', '//constellate.org', '#', 'NLP'] 

Tweet Tokenizer
['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform', '.', 'http://constellate.org', '#NLP'] 

MWE Tokenizer
['Nathan_Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform', '.', 'http', ':', '//constellate.org', '#', 'NLP']


The tokenizer will generate a list of unigrams, but we still need to generate our bigrams and trigrams. We can simply pass the tokens into NLTK's bigrams and trigrams methods then store the results in a list.

In [11]:
# Creating our bigrams and trigrams
bigrams = list(nltk.bigrams(treebank_tokens))
trigrams = list(nltk.trigrams(treebank_tokens))

print('Bigrams: \n ', bigrams, '\n')
    
print('Trigrams: \n,', trigrams)


Bigrams: 
  [('Nathan', 'Kelber'), ('Kelber', 'is'), ('is', 'helping'), ('helping', 'us'), ('us', 'tokenize'), ('tokenize', 'with'), ('with', 'the'), ('the', 'Constellate'), ('Constellate', 'platform.'), ('platform.', 'http'), ('http', ':'), (':', '//constellate.org'), ('//constellate.org', '#'), ('#', 'NLP')] 

Trigrams: 
, [('Nathan', 'Kelber', 'is'), ('Kelber', 'is', 'helping'), ('is', 'helping', 'us'), ('helping', 'us', 'tokenize'), ('us', 'tokenize', 'with'), ('tokenize', 'with', 'the'), ('with', 'the', 'Constellate'), ('the', 'Constellate', 'platform.'), ('Constellate', 'platform.', 'http'), ('platform.', 'http', ':'), ('http', ':', '//constellate.org'), (':', '//constellate.org', '#'), ('//constellate.org', '#', 'NLP')]


The NLTK bigrams and trigrams method creates a list of bigrams that are tuples. If we want them to be strings, then we would need to access each index of the tuple and create a string out of it.

In [12]:
# Function definitions for Converting NLTK tuples into strings

from collections import Counter

def convert_tuple_bigrams(tuples_to_convert):
    """Converts NLTK tuples into bigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        gram_string = f'{first_word} {second_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_tuple_trigrams(tuples_to_convert):
    """Converts NLTK tuples into trigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        third_word = tuple_grams[2]
        gram_string = f'{first_word} {second_word} {third_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_strings_to_counts(string_grams):
    """Converts a Counter of n-grams into a dictionary"""
    counter_of_grams = Counter(string_grams)
    dict_of_grams = dict(counter_of_grams)
    return dict_of_grams

In [13]:
# Converting the tuples
string_bigrams = convert_tuple_bigrams(bigrams)
bigramCount = convert_strings_to_counts(string_bigrams)

print('Bigrams as a dictionary of counts')
print(bigramCount, '\n')

string_trigrams = convert_tuple_trigrams(trigrams)
trigramCount = convert_strings_to_counts(string_trigrams)

print('Trigrams as a dictionary of counts')
print(trigramCount)

Bigrams as a dictionary of counts
{'Nathan Kelber': 1, 'Kelber is': 1, 'is helping': 1, 'helping us': 1, 'us tokenize': 1, 'tokenize with': 1, 'with the': 1, 'the Constellate': 1, 'Constellate platform.': 1, 'platform. http': 1, 'http :': 1, ': //constellate.org': 1, '//constellate.org #': 1, '# NLP': 1} 

Trigrams as a dictionary of counts
{'Nathan Kelber is': 1, 'Kelber is helping': 1, 'is helping us': 1, 'helping us tokenize': 1, 'us tokenize with': 1, 'tokenize with the': 1, 'with the Constellate': 1, 'the Constellate platform.': 1, 'Constellate platform. http': 1, 'platform. http :': 1, 'http : //constellate.org': 1, ': //constellate.org #': 1, '//constellate.org # NLP': 1}


Depending on the analysis we are doing, we may want to group similar words together. For example, we may want to group plural words together and verb tenses.

* ducks -> duck
* flown -> fly

To accomplish this, we could use a stemmer, such as the Snowball stemmer. A stemmer removes the last part of particular words to get a base form. It is a quick method which is useful for very large datasets and/or working with limited computing power.

In an ideal world, a lemmatizer will do a better job. It does not simply strip off letters but looks up verb tenses and takes into account the part of speech of each word.

In [14]:
# Snowball stemmer
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
unstemmed_token = 'running'
#unstemmed_token = 'flown'

stemmed_token = stemmer.stem(unstemmed_token)

print(stemmed_token)

run


Part of Speech tagging allows us to see the parts of speech of various tokens.

In [None]:
# Part of Speech Tagging
pos_list = nltk.pos_tag(nltk.word_tokenize(string))
print(pos_list)

## spaCy

spaCy takes a different approach from NLTK, creating a document model of a text. It is more sophisticated, but uses a different syntax for NLP tasks.


In [18]:
# Install the spaCy Program
%pip install spacy
%pip install -U pip setuptools wheel
%pip install -U spacy
!python -m spacy download en_core_web_sm

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.


In [19]:
from spacy.lang.en import English

nlp = English()

string = "Nathan Kelber is helping us tokenize with the Constellate platform. http://constellate.org #NLP"

my_doc = nlp(string)

tokens = []
for token in my_doc:
    tokens.append(token.text)

print(tokens)

['Nathan', 'Kelber', 'is', 'helping', 'us', 'tokenize', 'with', 'the', 'Constellate', 'platform', '.', 'http://constellate.org', '#', 'NLP']


In order to change tokenization with spaCy, you can [add rules](https://machinelearningknowledge.ai/complete-guide-to-spacy-tokenizer-with-examples/). spaCy also supports Parts of Speech tagging and lemmatization.

In [20]:
import spacy
nlp = spacy.load('en_core_web_sm')
my_doc = nlp(string)

print('Parts of Speech')
for token in my_doc:
    print(token, token.pos_,)

print('\nLemmatizations')
for token in my_doc:
    print(token, token.lemma_)

Parts of Speech
Nathan PROPN
Kelber PROPN
is AUX
helping VERB
us PRON
tokenize VERB
with ADP
the DET
Constellate PROPN
platform NOUN
. PUNCT
http://constellate.org X
# SYM
NLP PROPN

Lemmatizations
Nathan Nathan
Kelber Kelber
is be
helping help
us we
tokenize tokenize
with with
the the
Constellate Constellate
platform platform
. .
http://constellate.org http://constellate.org
# #
NLP NLP


We can gather our n-grams by defining a function that accepts our tokens and an argument `n` for the "n" in "n-gram." So, a bigram would be n = 2.

In [21]:
# A function for gathering n-grams with spaCy
def n_grams(tokens, n):
    n_grams = []
    for i in range(len(tokens)-n+1):
        n_grams.append(tokens[i:i+n])
    return(n_grams)
    # return[tokens[i:i+n] for i in range(len(tokens)-n+1)] # Written as a list comprehension

In [22]:
bigrams = n_grams(tokens, 2)
trigrams = n_grams(tokens, 3)
print(bigrams)
print(trigrams)

[['Nathan', 'Kelber'], ['Kelber', 'is'], ['is', 'helping'], ['helping', 'us'], ['us', 'tokenize'], ['tokenize', 'with'], ['with', 'the'], ['the', 'Constellate'], ['Constellate', 'platform'], ['platform', '.'], ['.', 'http://constellate.org'], ['http://constellate.org', '#'], ['#', 'NLP']]
[['Nathan', 'Kelber', 'is'], ['Kelber', 'is', 'helping'], ['is', 'helping', 'us'], ['helping', 'us', 'tokenize'], ['us', 'tokenize', 'with'], ['tokenize', 'with', 'the'], ['with', 'the', 'Constellate'], ['the', 'Constellate', 'platform'], ['Constellate', 'platform', '.'], ['platform', '.', 'http://constellate.org'], ['.', 'http://constellate.org', '#'], ['http://constellate.org', '#', 'NLP']]


While NLTK and spaCy tokenizers are the most prominent, there are also tokenizers available for packages such as:

* [Gensim](https://radimrehurek.com/gensim/)
* [Keras](https://keras.io/)
* [Stanford NLP](https://nlp.stanford.edu/software/tokenizer.shtml)