This notebook was created by [William Mattingly](https://datascience.si.edu/people/dr-william-mattingly) for the 2022 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Arizona Libraries](https://new.library.arizona.edu/) and [Zoe LeBlanc](https://ischool.illinois.edu/people/zoe-leblanc) for the 2021 Text Analysis Pedagogy Institute, with support from the [National Endowment for the Humanities](https://neh.gov), [JSTOR Labs](https://labs.jstor.org/), and [University of Virginia Libraries](https://library.virginia.edu).

This notebook is adapted by Zhuo Chen under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).
____

# Multilingual NER 3

This is lesson 3 in the educational series on named entity recognition. 

**Description:** This notebook describes:
* how to understand word embeddings as a concept
* how to understand Machine Learning as a concept
* how to understand supervised learning
* how to do NER ML in spaCy 3

**Knowledge Required:** 

* Python basics ([start learning Python basics](../../PythonForDataAnalysis/GettingStarted/basic/python-basics-1.ipynb))
* [Python intermediate 4](../../PythonForDataAnalysis/GettingStarted/intermediate/python-intermediate-4.ipynb) (OOP, classes, instances, inheritance)

**Knowledge Recommended:**

* Basic file operations ([start learning file operations](../../PythonForDataAnalysis/GettingStarted/intermediate/python-intermediate-2.ipynb))
* Data cleaning with `Pandas` ([start learning Pandas](../../PythonForDataAnalysis/GettingStarted/pandas/pandas-1.ipynb))
___

# Install libraries

In [None]:
%pip install spacy # for NLP
%pip install pandas # for making tabular data
!python -m spacy download en_core_web_sm # for English NER
!python -m spacy download en_core_web_md # for showing the word vectors

# Introduction to word embeddings

How do we represent word meanings in NLP? One way we can represent word meanings is to use word vectors. **Word embeddings** are vector representations of words.

## Distributional hypothesis

Word embeddings is inspired by the **distributional hypothesis** proposed by Harris ([1954](https://doi.org/10.1080/00437956.1954.11659520)). This theory could be summarized as: words that have similar context will have similar meanings.

What does "context" mean in word embeddings? Basically, "context" means the neighboring words of a target word. 

Consider the following example. If we choose "village" as the target word and choose a fixed size context window of 2, the two words before "village" and the two words after "village" will constitute the context of the target word.

Treblinka is **a small** **<span style="color: blue;">village</span>** **in Poland.**



## Word2Vec

Google’s pre-trained word2vec model includes word vectors for a vocabulary of 3 million words and phrases that they trained on roughly 100 billion words from a Google News dataset. The vector length is 300 features, which means each of the 3 million words in the vocabulary is represented by a vector with 300 floating numbers. Word2Vec is one of the most popular techniques to learn word embeddings.

The training samples are the (target, context) pairs from the text data. For example, suppose your source text is the sentence "The quick brown fox jumps over the lazy dog". If you choose "quick" as your target word and have set a context window of size 2, you will get three training samples for it, i.e. (quick, the), (quick, brown) and (quick fox).   

**McCormick, C**. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://mccormickml.com/

The word2vec model is trained to accomplish the following task: given the input word $w_{1}$, for each word $w_{2}$ in our vocab, how likely $w_{2}$ is a context word of $w_{1}$.

The network is going to learn the statistics from the number of times each (target, context) shows up. So, for example, if you have a text about kings, queens and kingdoms, the network is probably going to get many more training samples of ("King", "Queen") than ("King", "kangaroo"). Therefore, if you give your trained model the word "King" as input, then it will output a much higher probability for "Queen" than it will for "kangaroo".

## Word vectors in SpaCy

We have used the small English model from spaCy in the previous two notebooks. Actually, there are medium size and large size English models from spaCy as well. Both are trained using the word2vec family of algorithms.

In [2]:
import spacy

# Load the medium size English model from spaCy
nlp = spacy.load('en_core_web_md')

# Get the word vector for the word "King"
nlp("King").vector

array([-6.0644e-01, -5.1205e-01,  6.4921e-03, -2.9194e-01, -5.6515e-01,
       -1.1523e-01,  7.7274e-02,  3.3561e-01,  1.1593e-01,  2.3516e+00,
        5.1773e-02, -5.4229e-01, -5.7972e-01,  1.3220e-01,  2.8430e-01,
       -7.9592e-02, -2.6762e-01,  1.8301e-01, -4.1264e-01,  2.0459e-01,
        1.4436e-01, -1.8714e-01, -3.1393e-01,  1.7821e-01, -1.0997e-01,
       -2.5584e-01, -1.1149e-01,  9.6212e-02, -1.6168e-01,  4.0055e-01,
       -2.6115e-01,  5.3777e-01, -5.2382e-01,  2.7637e-01,  7.2191e-01,
        6.0405e-02, -1.7922e-01,  1.8020e-01, -1.4381e-01, -1.4795e-01,
       -8.1394e-02,  5.8282e-02,  2.2964e-02, -2.6374e-01,  1.0704e-01,
       -4.5425e-01, -1.9964e-01,  3.7720e-01, -9.7784e-02, -3.1999e-01,
       -7.8509e-02,  6.1502e-01,  7.1643e-02, -3.0930e-02,  2.1508e-01,
        2.5280e-01, -3.1643e-01,  6.6698e-01,  1.9813e-02, -3.2311e-01,
        2.9266e-02, -4.1403e-02,  2.8346e-01, -7.9143e-01,  1.3327e-01,
        7.7231e-02, -1.8724e-01, -3.3146e-01, -2.0797e-01, -6.93

In [3]:
# Get the size of the vector
nlp("King").vector.size

300

In [4]:
# Get the similarity between the two words "King" and "Queen"
nlp("King").similarity(nlp("Queen"))

0.38253095746040344

In [5]:
# Get the similarity between the two words "King" and "kangaroo"
nlp("King").similarity(nlp("kangaroo"))

0.2849182188510895

# Introduction to Machine Learning 

How is word2vector model trained? The model is trained using a machine learning technique. 

Machine learning is a branch of artificial intelligence. Traditionally the human writes the rules in a computer system to perform a specific task. In machine learning, we use statistics to write the rules for us.

## The machine learning pipeline
<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/NER_ML_pipeline.png' width=700></center>

Let's use a simple example to understand the ML pipeline. Suppose you are interested in the relationship between the size and the price of a house in your neighborhood. Specifically, you would like to use the size of a house to predict its price. You go to Redfin/Zillow and find the information about the recently sold houses in your neighborhood. You note down their size and sale price. You draw a scatter plot like the following to examine the data. 

<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/NER_housebuying_scatter.png' width=300></center>

What you have in this scatter plot is your data. Now, you would like to derive a relationship between the house size and house price. Let's use linear regression in this case. Essentially, you fit a line to the data points. 

<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/NER_housebuying.png' width=300></center>

The function for this line is y = ax + b (where y is the price and x is the # of sqft). Of course, you would not just fit any line to your data points. You would want to fit a line so that the difference between the actual house prices and the predicted house prices is the smallest. Our task, then, reduces to the calculation of the value of a and b in the function y = ax + b so that the difference between the actual house prices and the predicted house prices is the smallest.

## ML in Word2Vec

<center><img src='https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/NER_ML_pipeline.png' width=700></center>

The ML method used in word2vec is a shallow neural network with one hidden layer of neurons and one output layer of neurons. Chris McCormick has a very detailed explanation of this model in his blog post http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/. Let's go take a look.

## Supervised Learning

**Supervised learning** is the process by which a system learns from a set of inputs that have known labels. To train a model, you first need training data – text examples, and the gold standard – labels you want the model to predict. This means that your training data need to be annotated.

### Training and evaluation

"When training a model, we don’t just want it to memorize our examples – we want it to come up with a theory that can be generalized across unseen data. After all, we don’t just want the model to learn that this one instance of “Amazon” right here is a company – we want it to learn that “Amazon”, in contexts like this, is most likely a company. That’s why the training data should always be representative of the data we want to process. A model trained on Wikipedia, where sentences in the first person are extremely rare, will likely perform badly on Twitter. Similarly, a model trained on romantic novels will likely perform badly on legal text.

This also means that in order to know how the model is performing, and whether it’s learning the right things, you don’t only need training data – you’ll also need evaluation data."

https://spacy.io/usage/training

**Honnibal, M., & Montani, I.** (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.

The training data is used to hone a statistical model via predetermined algorithms. It does this by making guesses about what the proper labels are. It then checks its accuracy against the correct labels, i.e., the annotated labels, and makes adjustments accordingly. Once it is finished viewing and guessing across all the training data, the first **epoch**, or **iteration** over the data, is finished. At this stage, the model then tests its accuracy against the evaluation data. The training data is then randomized and given back to the system for x number of epochs.

# NER with EntityRuler vs. ML NER

In this section, we are going to make two models to do the same NER task, one doing NER with an EntityRuler and the other doing NER using word vectors.

First, let's download the two data files needed for this example. 

In [6]:
import urllib.request
from pathlib import Path

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

# Download the files
urls = [
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/NER_HarryPotter_FilmSpells.csv',
    'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/NER_HarryPotter_Spells.csv',
]

for url in urls:
    urllib.request.urlretrieve(url, '../data/' + url.rsplit('/', 1)[-1])   
print('Sample files ready.')

Sample files ready.


The first file stores the information about the spells in Harry Potter. 

In [7]:
import pandas as pd
spells_df = pd.read_csv('../data/NER_HarryPotter_Spells.csv', sep=";")
spells_df

Unnamed: 0,Name,Incantation,Type,Effect,Light
0,Summoning Charm,Accio,Charm,Summons an object,
1,Age Line,Unknown,Charm,Prevents people above or below a certain age f...,Blue
2,Water-Making Spell,Aguamenti,"Charm, Conjuration",Conjures water,Icy blue
3,Launch an object up into the air,Alarte Ascendare,Charm,Rockets target upward,Red
4,Albus Dumbledore's Forceful Spell,Unknown,Spell,Great Force,
...,...,...,...,...,...
296,Waddiwasi,Waddiwasi,Jinx,Propels wad at the target,
297,Washing up spell,Unknown,Charm,Cleans dishes,
298,Levitation Charm,Wingardium Leviosa,Charm,Makes objects fly,
299,White sparks,Unknown,Charm,Jet of white sparks,White


In the second file, we find the characters speaking and their speech. Notice that there is a column storing the spells found in the sentence if there is one. 

In [8]:
film_spells = pd.read_csv('../data/NER_HarryPotter_FilmSpells.csv')
film_spells

Unnamed: 0,Character,Sentence,movie_number,identified_spells
0,Dumbledore,"I should've known that you would be here, Prof...",film 1,
1,McGonagall,"Good evening, Professor Dumbledore.",film 1,
2,McGonagall,"Are the rumors true, Albus?",film 1,
3,Dumbledore,"I'm afraid so, professor.",film 1,
4,Dumbledore,The good and the bad.,film 1,
...,...,...,...,...
4923,HERMIONE,"How fast is it, Harry?",film 3,
4924,HARRY,Lumos.,film 3,Lumos
4925,HARRY,I solemnly swear that I am up to no good.,film 3,
4926,HARRY,Mischief managed.,film 3,


Suppose we would like to create a model that can identify spells in a sentence and give it the label 'SPELL'.

## Create an NLP model with an EntityRuler to identify the spells

In the following, we will first create a NLP model with an entity ruler that identifies spells. This section can be seen as a review of what we have learned about EntityRuler in Wednesday's lesson.
Before we create a new EntityRuler, we will do some preprocessing of the data to get the patterns that we will add to the EntityRuler.

### Preprocessing the data

In [9]:
# Fill the NaN cells with an empty string
spells_df['Incantation'] = spells_df['Incantation'].fillna("")

# Get all spells
spells = spells_df['Incantation'].unique().tolist() # Put all strs in the 'Incantation' column in a list
spells = [spell for spell in spells if spell != ''] # Get all non-empty strs from the list, i.e. all the spells

# Take a look at the spells
spells

['Accio',
 'Unknown',
 'Aguamenti',
 'Alarte Ascendare',
 'Alohomora',
 'Anapneo',
 'Anteoculatia',
 'Aparecium',
 'Appare Vestigium',
 'Aqua Eructo',
 'Arania Exumai',
 'Arresto Momentum',
 'Ascendio',
 'Avada Kedavra',
 'Avifors\xa0',
 'Avenseguim',
 'Avis',
 'Baubillious',
 'Bombarda',
 'Bombarda Maxima',
 'Brackium Emendo',
 'Calvorio',
 'Cantis',
 'Capacious extremis',
 'Carpe Retractum',
 'Cave inimicum',
 'Circumrota',
 'Cistem Aperio',
 'Colloportus',
 'Colloshoo',
 'Colovaria',
 'Confringo',
 'Confundo',
 'Crinus Muto',
 'Crucio',
 'Defodio',
 'Deletrius',
 'Densaugeo',
 'Deprimo',
 'Depulso',
 'Descendo',
 'Diffindo',
 'Diminuendo',
 'Dissendium',
 'Draconifors',
 'Ducklifors',
 'Duro',
 'Ebublio',
 'Engorgio',
 'Engorgio Skullus',
 'Entomorphis',
 'Epoximise',
 'Erecto',
 'Evanesce',
 'Evanesco',
 'Everte Statum',
 'Expecto Patronum',
 'Expelliarmus',
 'Expulso',
 'Ferula',
 'Fianto Duri',
 'Finestra',
 'Finite',
 'Flagrante',
 'Flagrate',
 'Flintifors',
 'Flipendo',
 'Flipe

### Creating the patterns to be added to the EntityRuler

Recall from Wednesday's lesson that the patterns we add to an EntityRuler look like the following.

`patterns = [{"label": "GPE", "pattern": "Aars"}]`

In [10]:
# Write the pattern to be added to the ruler
patterns = [{"label":"SPELL", "pattern":spell} for spell in spells]

Now that we have the patterns ready, we can add them to an EntityRuler and add the ruler as a new pipe. 

In [11]:
# Create an EntityRuler and add the patterns to the ruler
entruler_nlp = spacy.blank('en') # Create a blank English model
ruler = entruler_nlp.add_pipe("entity_ruler") 
ruler.add_patterns(patterns)

In [12]:
test_text = """Ron Weasley: Wingardium Leviosa! Hermione Granger: You're saying it wrong. 
It's Wing-gar-dium Levi-o-sa, make the 'gar' nice and long. 
Ron Weasley: You do it, then, if you're so clever"""
doc = entruler_nlp(test_text)
for ent in doc.ents:
    print('EntRulerModel', ent.text, ent.label_)

EntRulerModel Wingardium Leviosa SPELL


In this model, we have basically hard written all spell strings in the EntityRuler. 

## Train a NLP model using ML to identify the spells

The format of the training data will look like the following. It is a list of tuples. In each tuple, the first element is the text string containing spells and the second element is a dictionary. The key of the dictionary is 'entities'. The value is a list of lists. In each list, we find the starting index, ending index and the label of the spell(s) found in the text string. 

`[
('Oculus Reparo', {'entities': [[0, 13, 'SPELL']]}),
('Alohomora', {'entities': [[0, 9, 'SPELL']]})
]`

The text strings we use for the training are from the 'Sentence' column of the film_spells dataframe.

In [13]:
# Take a look at the film_spells df
film_spells

Unnamed: 0,Character,Sentence,movie_number,identified_spells
0,Dumbledore,"I should've known that you would be here, Prof...",film 1,
1,McGonagall,"Good evening, Professor Dumbledore.",film 1,
2,McGonagall,"Are the rumors true, Albus?",film 1,
3,Dumbledore,"I'm afraid so, professor.",film 1,
4,Dumbledore,The good and the bad.,film 1,
...,...,...,...,...
4923,HERMIONE,"How fast is it, Harry?",film 3,
4924,HARRY,Lumos.,film 3,Lumos
4925,HARRY,I solemnly swear that I am up to no good.,film 3,
4926,HARRY,Mischief managed.,film 3,


Since we have hard written all spell strings in the EntityRuler and give them the label 'SPELL', we could just use this model to generate labeled data as our training data and evaluation data.

In [14]:
import nltk # for sentence tokenization
nltk.download('punkt')
def generate_labeled_data(ls_sents): # the input will be a list of strings
    text = ' '.join(ls_sents)
    sents = nltk.sent_tokenize(text)
    labeled_data = []
    for sent in sents:
        doc = entruler_nlp(sent) # create a doc object
        if doc.ents != (): # if there is at least one entity identified
            labeled_data.append((sent, {"entities":[[ent.start_char, ent.end_char, ent.label_] for ent in doc.ents]}))
    return labeled_data       

# Assign the result from the function to a new variable
training_validation_data = generate_labeled_data(film_spells['Sentence'].tolist())

# Take a look at the labeled data
training_validation_data

[nltk_data] Downloading package punkt to /Users/mearacox/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('For example: Oculus Reparo.', {'entities': [[13, 26, 'SPELL']]}),
 ('Alohomora Get in Alohomora?',
  {'entities': [[0, 9, 'SPELL'], [17, 26, 'SPELL']]}),
 ('Wingardium Leviosa.', {'entities': [[0, 18, 'SPELL']]}),
 ('Wingardium Leviosa.', {'entities': [[0, 18, 'SPELL']]}),
 ('Wingardium Leviosa!', {'entities': [[0, 18, 'SPELL']]}),
 ("Neville, I'm really, really sorry about this  Petrificus Totalus.",
  {'entities': [[46, 64, 'SPELL']]}),
 ('Alohomora.', {'entities': [[0, 9, 'SPELL']]}),
 ('Alohomora!', {'entities': [[0, 9, 'SPELL']]}),
 ('Oculus Reparo.', {'entities': [[0, 13, 'SPELL']]}),
 ('Peskipiksi Pesternomi!', {'entities': [[0, 21, 'SPELL']]}),
 ('Immobulus!', {'entities': [[0, 9, 'SPELL']]}),
 ('Vera Verto.', {'entities': [[0, 10, 'SPELL']]}),
 ('Vera Verto.', {'entities': [[0, 10, 'SPELL']]}),
 ('Vera Verto!', {'entities': [[0, 10, 'SPELL']]}),
 ('Finite Incantatem!', {'entities': [[0, 6, 'SPELL']]}),
 ('Brackium Emendo!', {'entities': [[0, 15, 'SPELL']]}),
 ('Expelliarmus

spaCy 3 requires that our data be stored in the proprietary `.spacy` format. To do that we need to use the `DocBin` class.

In [15]:
from spacy.tokens import DocBin 

db = DocBin() 

for text, annot in training_validation_data[:19*2]: # Get the first 38 tuples as the training data
    doc = entruler_nlp(text) # create a doc object
    doc.ents = [doc.char_span(ent[0], ent[1], label=ent[2]) for ent in annot['entities']]
    db.add(doc)
db.to_disk(f"./train_spells.spacy")

In [16]:
for text, annot in training_validation_data[19*2:]: # Get the rest tuples as the validation data
    doc = entruler_nlp(text) 
    doc.ents = [doc.char_span(ent[0], ent[1], label=ent[2]) for ent in annot['entities']]
    db.add(doc)
db.to_disk(f"./valid_spells.spacy")

Now we can finally start training our model! 

In [17]:
!python -m spacy init config --lang en --pipeline ner config.cfg --force

[38;5;3m⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [18]:
!python -m spacy train config.cfg --output ./output/spells-model/ --paths.train ./train_spells.spacy --paths.dev ./valid_spells.spacy

[38;5;2m✔ Created output directory: output/spells-model[0m
[38;5;4mℹ Saving to output directory: output/spells-model[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     70.00   33.99   26.80   46.43    0.34
141     200          1.46    584.26   98.18  100.00   96.43    0.98
341     400          0.00      0.00   98.18  100.00   96.43    0.98
541     600          0.00      0.00   98.18  100.00   96.43    0.98
741     800          0.00      0.00   98.18  100.00   96.43    0.98
941    1000          0.00      0.00   98.18  100.00   96.43    0.98
1141    1200          0.00      0.00   98.18  100.00   96.43    0.98
1341    1400          0.00      0.00   98.18  100.00   96.43    0.98
1541    1600          0.00      0.00  

Now let's finally run our model!

In [19]:
# Load the best model
model_best = spacy.load('./output/spells-model/model-best')

In [20]:
# Let's try our model on this long text string
test_text = """53. Imperio - Makes target obey every command But only for really, really funny pranks. 52. Piertotum Locomotor - Animates statues On one hand, this is awesome. On the other, someone would use this to scare me.

51. Aparecium - Make invisible ink appear

Your notes will be so much cooler.

50. Defodio - Carves through stone and steel

Sometimes you need to get the eff out of there.

49. Descendo - Moves objects downward

You'll never have to get a chair to reach for stuff again.

48. Specialis Revelio - Reveals hidden magical properties in an object

I want to know what I'm eating and if it's magical.

47. Meteolojinx Recanto - Ends effects of weather spells

Otherwise, someone could make it sleet in your bedroom forever.

46. Cave Inimicum/Protego Totalum - Strengthens an area's defenses

Helpful, but why are people trying to break into your campsite?

45. Impedimenta - Freezes someone advancing toward you

"Stop running at me! But also, why are you running at me?"

44. Obscuro - Blindfolds target

Finally, we don't have to rely on "No peeking."

43. Reducto - Explodes object

The "raddest" of all spells.

42. Anapneo - Clears someone's airway

This could save a life, but hopefully you won't need it.

41. Locomotor Mortis - Leg-lock curse

Good for footraces and Southwest Airlines flights.

40. Geminio - Creates temporary, worthless duplicate of any object

You could finally live your dream of lying on a bed of marshmallows, and you'd only need one to start.

39. Aguamenti - Shoot water from wand

No need to replace that fire extinguisher you never bought.

38. Avada Kedavra - The Killing Curse

One word: bugs.

37. Repelo Muggletum - Repels Muggles

Sounds elitist, but seriously, Muggles ruin everything. Take it from me, a Muggle.

36. Stupefy - Stuns target

Since this is every other word of the "Deathly Hallows" script, I think it's pretty useful."""

# Create a doc object out of the text string using the trained model
doc = model_best(test_text)

# Find out the entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Animates SPELL
Aparecium - SPELL
Your SPELL
Defodio - SPELL
Sometimes you SPELL
of SPELL
Descendo SPELL
Specialis Revelio SPELL
Reveals SPELL
Meteolojinx Recanto SPELL
46 SPELL
Protego Totalum SPELL
Strengthens SPELL
Freezes SPELL
Stop SPELL
Obscuro - SPELL
Blindfolds SPELL
Reducto SPELL
Explodes SPELL
Anapneo - SPELL
Clears SPELL
41 SPELL
Locomotor Mortis SPELL
Southwest Airlines SPELL
40 SPELL
Aguamenti - SPELL
Shoot SPELL
Avada Kedavra SPELL
Killing Curse SPELL
Repelo Muggletum SPELL
Repels SPELL
Sounds SPELL
Muggle SPELL
Stupefy SPELL
Since SPELL
Deathly Hallows SPELL


Let's also try the model we created with an EntityRuler with all spell names hard written in it.

In [21]:
# Create a doc object out of the text string using the EntityRuler model
doc = entruler_nlp(test_text)

# Find out the entities
for ent in doc.ents:
    print(ent.text, ent.label_)

Imperio SPELL
Piertotum Locomotor SPELL
Aparecium SPELL
Defodio SPELL
Descendo SPELL
Specialis Revelio SPELL
Meteolojinx Recanto SPELL
Protego SPELL
Impedimenta SPELL
Obscuro SPELL
Reducto SPELL
Anapneo SPELL
Locomotor Mortis SPELL
Geminio SPELL
Aguamenti SPELL
Avada Kedavra SPELL
Stupefy SPELL


It seems in this example our EntityRuler model performs better than our trained model. Why do we think that is?

Part of the reason we aren't getting better results is something that Ines Montani describes in this Stack Overflow answer https://stackoverflow.com/questions/50580262/how-to-use-spacy-to-create-a-new-entity-and-learn-only-from-keyword-list/50603247#50603247

"The advantage of training the named entity recognizer to detect SPECIES in your text is that the model won't only be able to recognise your examples, but also generalise and recognise other species in context. If you only want to find a fixed set of terms and not more, a simpler, rule-based approach might work better for you."

# References
McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com