Natural language processing (NLP) using Python NLTK (Simple Examples)

The Natural Language Toolkit, or NLTK, is a Python library created for symbolic and natural language processing tasks.

It has the potential to make natural language processing accessible to everyone, from the English language to any natural human language.

 

 

Installing Python NLTK

To get started, you need to install NLTK on your computer. Run the following command:

!pip install nltk

After installation, you need to import NLTK and download the necessary packages.

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('maxent_ne_chunker')
nltk.download('words')

Here’s the output that you should expect:

[nltk_data] Downloading package punkt to /home/user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /home/user/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
...

The above commands download several NLTK packages using nltk.download().

You will need these to perform tasks such as part of speech tagging, stopword removal, and lemmatization.
With the Natural Language Toolkit installed, we are now ready to explore the next steps of preprocessing.

 

Text Preprocessing

Text preprocessing is the practice of cleaning and preparing text data for machine learning algorithms. The primary steps include tokenizing, removing stop words, stemming, lemmatizing, and more.

These steps help reduce the complexity of the data and extract meaningful information from it.
In the coming sections of this tutorial, we’ll walk you through each of these steps using NLTK.

 

Sentence and word tokenization

Tokenization is the process of breaking down text into words, phrases, symbols, or other meaningful elements called tokens. The input to the tokenizer is a unicode text, and the output is a list of sentences or words.

In NLTK, we have two types of tokenizers – the word tokenizer and the sentence tokenizer.
Let’s see an example:

from nltk.tokenize import sent_tokenize, word_tokenize
text = "Natural language processing is fascinating. It involves many tasks such as text classification, sentiment analysis, and more."
sentences = sent_tokenize(text)
print(sentences)
words = word_tokenize(text)
print(words)

Output:

['Natural language processing is fascinating.', 'It involves many tasks such as text classification, sentiment analysis, and more.']
['Natural', 'language', 'processing', 'is', 'fascinating', '.', 'It', 'involves', 'many', 'tasks', 'such', 'as', 'text', 'classification', ',', 'sentiment', 'analysis', ',', 'and', 'more', '.']

The sent_tokenize function splits the text into sentences, and the word_tokenize function splits the text into words. As you can see, punctuation is also treated as a separate token.

 

Stopwords removal

In natural language processing, stopwords are words that you want to ignore, so you filter them out when you’re processing your text.

These are usually words that occur very frequently in any text and do not convey much meaning, such as “is”, “an”, “the”, “in”, etc.
NLTK comes with a predefined list of stopwords in several languages, including English.
Let’s use NLTK to filter out stopwords from our list of tokenized words:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "Natural language processing is fascinating. It involves many tasks such as text classification, sentiment analysis, and more."
stop_words = set(stopwords.words('english'))
words = word_tokenize(text)
filtered_words = [word for word in words if word.casefold() not in stop_words]
print(filtered_words)

Output:

['Natural', 'language', 'processing', 'fascinating', '.', 'involves', 'many', 'tasks', 'text', 'classification', ',', 'sentiment', 'analysis', ',', '.',]

In this piece of code, we first import the stopwords from NLTK, tokenize the text, and then filter out the stopwords. The casefold() method is used to ignore the case while comparing words to the stop words list.

 

Stemming

Stemming is the process of reducing inflection in words (like running, runs) to their root form (e.g., run). The ‘root’ in this case may not actually be a real root word, but just a canonical form of the original word. NLTK provides several famous stemmers interfaces, such as PorterStemmer.
Here’s how to use NLTK’s PorterStemmer:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
text = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
porter_stemmer = PorterStemmer()
words = word_tokenize(text)
stemmed_words = [porter_stemmer.stem(word) for word in words]
print(stemmed_words)

Output:

['he', 'wa', 'run', 'and', 'eat', 'at', 'same', 'time', '.', 'he', 'ha', 'bad', 'habit', 'of', 'swim', 'after', 'play', 'long', 'hour', 'in', 'the', 'sun', '.']

In this piece of code, we first tokenize the text, and then we pass each word into the stem function of our stemmer.

Note how the words “running”, “eating”, “swimming”, and “playing” have been reduced to their root form: “run”, “eat”, “swim”, and “play”, respectively.

 

Lemmatization

Lemmatization is a process that takes into consideration the morphological analysis of the words and efficiently reduces a word to its base or root form.

Unlike stemming, it reduces the inflected words properly ensuring that the root word, also known as the lemma, belongs to the language.
We’ll use the WordNet lexical database for lemmatization. WordNetLemmatizer is a class which gets the lemma of a word.
Here’s an example:

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
text = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
lemmatizer = WordNetLemmatizer()
words = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

Output:

['He', 'wa', 'running', 'and', 'eating', 'at', 'same', 'time', '.', 'He', 'ha', 'bad', 'habit', 'of', 'swimming', 'after', 'playing', 'long', 'hour', 'in', 'the', 'Sun', '.']

In this code, we first tokenize the text, and then we pass each word into the lemmatize function of our lemmatizer.

Note how the words “was” and “has” have been reduced to their lemma: “wa” and “ha”, respectively.

 

Part of Speech tagging

Part of speech (POS) tagging is the process of marking a word in a text as corresponding to a particular part of speech (noun, verb, adjective, etc.), based on both its definition and its context.

The NLTK library has a function called pos_tag to label words with a part of speech descriptor.
Let’s see it in action:

from nltk.tokenize import word_tokenize
from nltk import pos_tag
text = "Natural language processing is fascinating. It involves many tasks such as text classification, sentiment analysis, and more."
words = word_tokenize(text)
tagged_words = pos_tag(words)
print(tagged_words)

Output:

[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('fascinating', 'VBG'), ('.', '.'), ('It', 'PRP'), ('involves', 'VBZ'), ('many', 'JJ'), ('tasks', 'NNS'), ('such', 'JJ'), ('as', 'IN'), ('text', 'NN'), ('classification', 'NN'), (',', ','), ('sentiment', 'NN'), ('analysis', 'NN'), (',', ','), ('and', 'CC'), ('more', 'JJR'), ('.', '.')]

The pos_tag function returns a tuple with the word and a tag representing the part of speech. For instance, ‘NN’ stands for a noun, ‘JJ’ is an adjective, ‘VBZ’ is a verb in the third person, and so on.

Here’s a list of some common POS (Part of Speech) tags used in NLTK, along with their meaning:

Tag Meaning
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb

These tags are part of the Penn Treebank tagset.

 

Named Entity Recognition

Named Entity Recognition (NER) is the process of locating and classifying named entities present in your text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Let’s try out a simple example:

from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
text = "John works at Google in Mountain View, California."
words = word_tokenize(text)
tagged_words = pos_tag(words)
named_entities = ne_chunk(tagged_words)
print(named_entities)

The output will be a tree with named entities as subtrees. The label of the subtree will indicate the type of the entity (i.e., PERSON, ORGANIZATION, LOCATION, etc.). For instance:

(S
  (PERSON John/NNP)
  works/VBZ
  at/IN
  (ORGANIZATION Google/NNP)
  in/IN
  (GPE Mountain/NNP View/NNP)
  ,/,
  (GPE California/NNP)
  ./.)

In this code, we first tokenize the text and then tag each word with its part of speech.

The ne_chunk function then identifies the named entities. In the result, ‘John’ is recognized as a person, ‘Google’ as an organization, and ‘Mountain View’ and ‘California’ as geographical locations.

 

Understanding synsets

A synset (or synonym set) is a collection of synonyms that are interchangeable in some contexts.

These are a very useful resource for building knowledge graphs, semantic links, or for finding the meaning of a word in a context.

NLTK provides an interface to the WordNet API, which can be used to look up words and their synonyms, definitions, and examples.
Let’s demonstrate how to use this:

from nltk.corpus import wordnet
syn = wordnet.synsets("dog")[0]
print(f"Synset name: {syn.name()}")
print(f"Lemma names: {syn.lemma_names()}")
print(f"Definition: {syn.definition()}")
print(f"Examples: {syn.examples()}")

Output:

Synset name: dog.n.01
Lemma names: ['dog', 'domestic_dog', 'Canis_familiaris']
Definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
Examples: ['the dog barked all night']

In this code, wordnet.synsets("dog")[0] gives us the first synset of the word “dog”. The name method returns the name of the synset, lemma_names gives all synonyms, definition provides a brief definition, and examples provide usage examples.

 

Semantic relationships

Semantic relationships between words are an integral part of natural language understanding tasks. NLTK provides easy-to-use interfaces to explore these relationships:

  • Hyponyms: More specific terms. For example, ‘poodle’ is a hyponym of ‘dog’.
  • Hypernyms: More general terms. For example, ‘dog’ is a hypernym of ‘poodle’.
  • Antonyms: Opposite terms. For example, ‘good’ is an antonym of ‘evil’.

Here’s how to explore these relationships using NLTK:

from nltk.corpus import wordnet
syn = wordnet.synsets('dog')[0]

# Get hyponyms for dog
hyponyms = syn.hyponyms()
print("Hyponyms of 'dog': ", [h.lemmas()[0].name() for h in hyponyms])

syn = wordnet.synsets('poodle')[0]

# Get hypernyms for poodle
hypernyms = syn.hypernyms()
print("Hypernyms of 'dog': ", [h.lemmas()[0].name() for h in hypernyms])

# Get Antonym
synsets = wordnet.synsets('good')
antonym = None

# Search for an antonym in all synsets/lemmas
for syn in synsets:
    for lemma in syn.lemmas():
        if lemma.antonyms():
            antonym = lemma.antonyms()[0].name()
            break
    if antonym:
        break
if antonym:
    print("Antonym of 'good': ", antonym)
else:
    print("No antonym found for 'good'")

Output:

Hyponyms of 'dog': ['basenji', 'corgi', 'cur', 'dalmatian', 'Great_Pyrenees', 'griffon', 'hunting_dog', 'lapdog', 'Leonberg', 'Mexican_hairless', 'Newfoundland', 'pooch', 'poodle', 'pug', 'puppy', 'spitz', 'toy_dog', 'working_dog']
Hypernyms of 'dog': ['dog']
Antonym of 'good':  evil

The hyponyms method gives a list of more specific terms (hyponyms), while hypernyms gives a list of more general terms (hypernyms).

For antonyms, we first iterates over all synsets of “good”. Then it iterates over all lemmas of a synset. If it finds an antonym, it breaks from the loops and prints the antonym.

If no antonym is found after checking all synsets and lemmas, it prints a message to indicate that no antonym was found.

 

Measuring semantic similarity

We can also measure the semantic similarity between two words based on the distance between these words in the hypernym tree.
Here is an example:

from nltk.corpus import wordnet

# Get the first synset for each word
dog = wordnet.synsets('dog')[0]
cat = wordnet.synsets('cat')[0]

# Get the similarity value
similarity = dog.path_similarity(cat)
print("Semantic similarity between 'dog' and 'cat': ", similarity)

Output:

Semantic similarity:  0.2

In this code, we first get the first synset of each word using wordnet.synsets(). Then we measure the semantic similarity between these synsets using path_similarity().

In this example, we are only comparing the first sense of each word. If you want a more comprehensive measure of similarity, you may need to compare all senses of the words and possibly aggregate the similarity scores in some way.

Here’s an example of how to do this:

from nltk.corpus import wordnet

# Get all synsets for each word
synsets_dog = wordnet.synsets('dog')
synsets_cat = wordnet.synsets('cat')

# Initialize max similarity
max_similarity = 0

# Compare all pairs of synsets
for synset_dog in synsets_dog:
    for synset_cat in synsets_cat:
        similarity = synset_dog.path_similarity(synset_cat)
        if similarity is not None:  # If the words are connected in the hypernym/hyponym taxonomy
            max_similarity = max(max_similarity, similarity)
print("Comprehensive semantic similarity between 'dog' and 'cat': ", max_similarity)

Output:

Comprehensive semantic similarity between 'dog' and 'cat': 0.2

In this script, we first get all synsets of each word using wordnet.synsets(). Then we initialize the max similarity to 0.

We compare all pairs of synsets and update the max similarity each time we find a higher similarity. Finally, we print the max similarity.

 

Context-free grammar

In natural language processing, a context-free grammar (CFG) is a formal grammar which is used to generate all possible sentences in a given formal language.
Here’s how you can define a CFG in NLTK and generate sentences from it:

from nltk import CFG
from nltk.parse.generate import generate
grammar = CFG.fromstring("""
    S -> NP VP
    VP -> V NP | V NP PP
    PP -> P NP
    V -> "saw" | "ate"
    NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
    Det -> "a" | "an" | "the" | "my"
    N -> "dog" | "cat" | "cookie" | "park"
    P -> "in" | "on" | "by" | "with"
""")

for sentence in generate(grammar, n=10):  # generating only 10 sentences
    print(' '.join(sentence))

Output:

John saw John
John saw Mary
John saw Bob
John saw a dog
John saw a cat
John saw a cookie
John saw a park
John saw an dog
John saw an cat
John saw an cookie

In this code, we first define a context-free grammar in NLTK using CFG.fromstring method.

The string contains the rules of the CFG in the format "LHS -> RHS", where LHS is a single non-terminal symbol, and RHS is a sequence of terminal and non-terminal symbols.

Then we generate sentences from the CFG using nltk.parse.generate.generate function.

 

Parse trees

A parse tree or parsing tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar.
Here’s how you can generate a parse tree for a sentence given a context-free grammar:

from nltk import CFG
from nltk.parse import RecursiveDescentParser

grammar = CFG.fromstring("""
    S -> NP VP
    VP -> V NP | V NP PP
    PP -> P NP
    V -> "saw" | "ate"
    NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
    Det -> "a" | "an" | "the" | "my"
    N -> "dog" | "cat" | "cookie" | "park"
    P -> "in" | "on" | "by" | "with"
""")

rd_parser = RecursiveDescentParser(grammar)

sentence = 'John saw a cat'.split()

for tree in rd_parser.parse(sentence):
    print(tree)

Output:

(S (NP John) (VP (V saw) (NP (Det a) (N cat))))

In this code, we first define a context-free grammar in NLTK using CFG.fromstring method. Then we create a RecursiveDescentParser instance with the given grammar.

After that, we provide a sentence as a list of words to the parse method of the RecursiveDescentParser instance. This method returns a generator which generates all possible parse trees for the given sentence.

 

Chunking

Chunking is a process of extracting phrases from unstructured text. Instead of just simple tokens which may not represent the actual meaning of the text, it’s beneficial to use phrases such as “South Africa” as a single word instead of ‘South’ and ‘Africa’ separate words.
Here’s how you can do noun phrase chunking in NLTK:

import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.chunk import RegexpParser
sentence = "The big cat ate the little mouse who was after fresh cheese"

# PoS tagging
tagged = pos_tag(word_tokenize(sentence))

# Define your grammar using regular expressions
grammar = ('''
    NP: {<DT>?<JJ>*<NN>}  # NP
''')
chunk_parser = RegexpParser(grammar)
result = chunk_parser.parse(tagged)
print(result)

Output:

(S
  (NP The/DT big/JJ cat/NN)
  ate/VBD
  (NP the/DT little/JJ mouse/NN)
  who/WP
  was/VBD
  after/IN
  (NP fresh/JJ cheese/NN))

In this code, we first tokenized and PoS tagged our sentence. Then we defined a grammar for a noun phrase (NP) to be any optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).

Then we created a chunk parser with this grammar using RegexpParser, and finally parsed our tagged sentence.

 

Chinking

Chinking is the process of removing a sequence of tokens from a chunk. If the matching sequence of tokens spans an entire chunk, then the entire chunk is removed; if the sequence of tokens appears in the middle of the chunk, these tokens are removed, leaving two chunks where there was only one before.

If the sequence is at the periphery of the chunk, these tokens are removed, and a smaller chunk remains.
Here’s how you can do chinking in NLTK:

from nltk import pos_tag, RegexpParser
from nltk.tokenize import word_tokenize
sentence = "The big cat ate the little mouse who was after fresh cheese"

# PoS tagging
tagged = pos_tag(word_tokenize(sentence))

# We are removing from the chink one or more verbs, prepositions, determiners, or the word 'to'.

grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN|DT|TO>+{      # Chink sequences of VBD, IN, DT, TO
"""

chunk_parser = RegexpParser(grammar)
result = chunk_parser.parse(tagged)
print(result)

Output:

(S
  (NP The/DT big/JJ cat/NN)
  ate/VBD
  (NP the/DT little/JJ mouse/NN)
  who/WP
  was/VBD
  after/IN
  (NP fresh/JJ cheese/NN))

In this code, we first tokenize and PoS tag our sentence. Then we define a grammar for chinking: we are removing from the chunk one or more verbs, prepositions, determiners, or the word ‘to’.

Then we create a chunk parser with this grammar using RegexpParser, and finally parse our tagged sentence.

 

N-Grams

N-grams of texts are extensively used in text mining and natural language processing tasks.

They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward.
Here’s how to generate bigrams using NLTK:

from nltk import ngrams
from nltk.tokenize import word_tokenize

sentence = "The big cat ate the little mouse who was after fresh cheese"

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Generate bigrams
bigrams = list(ngrams(tokens, 2))

print(bigrams)

Output:

[('The', 'big'), ('big', 'cat'), ('cat', 'ate'), ('ate', 'the'), ('the', 'little'), ('little', 'mouse'), ('mouse', 'who'), ('who', 'was'), ('was', 'after'), ('after', 'fresh'), ('fresh', 'cheese')]

In the code above, we first tokenize our sentence, then generate bigrams using the ngrams function from NLTK.

The second argument to the ngrams function is the number of grams, in this case, 2. Hence, we get pairs of consecutive words.

 

Sentiment Analysis

For sentiment analysis, NLTK has a built-in module, nltk.sentiment.vader, which uses a combination of lexical and grammatical heuristics and a statistical model trained on human-annotated data.

Here’s a basic example of how you can perform sentiment analysis using NLTK:

from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
sia = SentimentIntensityAnalyzer()
text = "Python is an awesome programming language."
print(sia.polarity_scores(text))

Output:

{'neg': 0.0, 'neu': 0.439, 'pos': 0.561, 'compound': 0.6249}

In the code above, we first create a SentimentIntensityAnalyzer object. Then we feed a piece of text to the analyzer and print the resulting sentiment scores.

The output is a dictionary that contains four keys: neg, neu, pos, and compound. The neg, neu, and pos values represent the proportions of negative, neutral, and positive sentiment in the text, respectively.

The compound score is a summary metric that represents the overall sentiment of the text, calculated based on the previous three metrics.

 

Information Retrieval

Information retrieval is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources.

In terms of NLP and text mining, information retrieval is a critical component.
Here’s an example of how you can retrieve information about specific tokens using NLTK:

from nltk.text import Text
from nltk.tokenize import word_tokenize
text = "Python is a high-level programming language. Python is an interpreted language. Python is interactive."
tokens = word_tokenize(text)
text = Text(tokens)

# Concordance gives words that appear in a similar range of contexts
print(text.concordance("Python"))

Output:

Displaying 3 of 3 matches:
 Python is a high-level programming languag
a high-level programming language . Python is an interpreted language . Python
Python is an interpreted language . Python is interactive .
None

In this code, we first tokenize our text and create a Text object with our tokens. Then we call the concordance method on our Text object with the word “Python”. The concordance method gives words that appear in a similar range of contexts to our input word.

 

Frequency Distribution

Frequency Distribution is used to count the frequency of each word in a text. It is a distribution because it tells us how the total number of word tokens in the text are distributed across the types of words.

Here’s how to calculate frequency distribution using NLTK:

from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize

text = "Python is an interpreted, high-level, general-purpose programming language."

# Tokenize the sentence
tokens = word_tokenize(text)

# Create frequency distribution
fdist = FreqDist(tokens)

# Print the frequency of each word
for word, freq in fdist.items():
    print(f'{word}: {freq}')

Output:

Python: 1
is: 1
an: 1
interpreted: 1
,: 2
high-level: 1
general-purpose: 1
programming: 1
language: 1
.: 1

In the code above, we first tokenize our text and then create a frequency distribution with the FreqDist class from NLTK. We pass our tokens to the FreqDist class.

The FreqDist object has an items method that returns a list of tuples, where each tuple is a word from the text and its corresponding frequency. We print each word and its frequency.

 

Further Reading

https://www.nltk.org/book/

40 thoughts on “Natural language processing (NLP) using Python NLTK (Simple Examples)
  1. This blog was helpful to me! I look forward to reading your future post on text analysis.

  2. Very nicely written article. I like the examples you have provided and am waiting to look at your next article

    1. Thank you very much. Great to hear that.
      I’m always doing my best to write the best content.

  3. Just amazing, thank you so much. I look forward to learn more and more about NLP from you. 🙂

  4. My first time with Python and you have made it look so easy. I am encouraged to dive into NLP using YOUR articles.

  5. Simple and clear. Waiting for your next topic, text analysis using Python NLTK and also its followings.

  6. Thanks for posting this wonderful tutorial.. waiting for the next post(analyzing using NLTK).. Thanks again

  7. awesome explanation. Example of how a real expert can will articulate the most complex in the most simple way…way to go Mokhtar Ebrahim.

Leave a Reply

Your email address will not be published. Required fields are marked *