Advanced Text Analysis

Advanced Text Analysis#

These notebooks cover various foundational techniques in text analysis. One notebook introduces sentiment analysis using VADER and Hugging Face’s transformers to classify text sentiment. Another explains tokenization, covering concepts like word segmentation, n-grams, stemming, and lemmatization. Two notebooks demonstrate how to process plain text files and metadata into a JSON-L format containing unigrams, bigrams, trigrams, and full text. The final notebook focuses on topic modeling, including dataset retrieval, token cleaning, and topic visualization using Gensim and pyLDAvis.

Sentiment Analysis
Tokenizers
Tokenize Text Files
Tokenize Text Files with NLTK
Topic Modeling

Named Entity Recognition (NER)#

This three-part series on Named Entity Recognition (NER) introduces the challenges and techniques for working with multilingual texts. It begins by explaining NER concepts, text encoding, and the difficulties of processing multilingual corpora. The following lessons cover rule-based NER using spaCy, including creating an EntityRuler and language identification, and conclude with an introduction to word embeddings, machine learning, and implementing supervised NER using spaCy 3.

MultiLingual NER 1
MultiLingual NER 2
MultiLingual NER 3