NLP Tutorial Using Python NLTK (Simple Examples)
In this post, we will talk about natural language processing (NLP) using Python. This NLP tutorial will use Python NLTK library. NLTK is a popular Python library which is used for NLP.
So what is NLP? And what are the benefits of learning NLP?
What is NLP?
Simply and in short, natural language processing (NLP) is about developing applications and services that can understand human languages.
We are talking here about practical examples of natural language processing (NLP) like speech recognition, speech translation, understanding complete sentences, understanding synonyms of matching words, and writing complete grammatically correct sentences and paragraphs.
This is not everything; you can think about the industrial implementations about these ideas and their benefits.
Benefits of NLP
As all of you know, there are millions of gigabytes every day are generated by blogs, social websites, and web pages.
Many companies are gathering all of these data for understanding users and their passions and give reports to the companies to adjust their plans.
These data could show that the people of Brazil are happy with product A which could be a movie or anything while the people of the US are happy with product B. And this could be instant (real-time result). Like what search engines do, they give the appropriate results to the right people at the right time.
You know what, search engines are not the only implementation of natural language processing (NLP), and there are a lot of awesome implementations out there.
NLP implementations
These are some of the successful implementations of Natural Language Processing (NLP):
- Search engines like Google, Yahoo, etc. Google search engine understands that you are a tech guy, so it shows you results related to you.
- Social websites feeds like Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related Ads and posts more likely than other posts.
- Speech engines like Apple Siri.
- Spam filters like Google spam filters. It’s not just about the usual spam filtering, now spam filters understand what’s inside the email content and see if it’s a spam or not.
NLP libraries
There are many open-source Natural Language Processing (NLP) libraries, and these are some of them:
- Natural language toolkit (NLTK).
- Apache OpenNLP.
- Stanford NLP suite.
- Gate NLP library.
Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP) which is written in Python and has a big community behind it.
NLTK also is very easy to learn; it’s the easiest natural language processing (NLP) library that you’ll use.
In this NLP Tutorial, we will use the Python NLTK library.
Before I start installing NLTK, I assume that you know some Python basics to get started.
Install NLTK
If you are using Windows or Linux or Mac, you can install NLTK using pip:
$ pip install nltk
You can use NLTK on Python 2.7, 3.4, and 3.5 at the time of writing this post.
Alternatively, you can install it from source from this tar.
To check if NLTK has been installed correctly, you can open the python terminal and type the following:
Import nltk
If everything goes fine, that means you’ve successfully installed the NLTK library.
Once you’ve installed NLTK, you should install the NLTK packages by running the following code:
import nltk nltk.download()
This will show the NLTK downloader to choose what packages you need to install.
You can install all packages since they have small sizes, so no problem. Now let’s start the show.
Tokenize text using pure Python
First, we will grab a web page content then we will analyze the text to see what the page is about.
We will use the urllib module to crawl the web page:
import urllib.request response = urllib.request.urlopen('http://php.net/') html = response.read() print (html)
As you can see from the printed output, the result contains a lot of HTML tags that need to be cleaned.
We can use BeautifulSoup to clean the grabbed text like this:
from bs4 import BeautifulSoup import urllib.request response = urllib.request.urlopen('http://php.net/') html = response.read() soup = BeautifulSoup(html,"html5lib") text = soup.get_text(strip=True) print (text)
Now we have a clean text from the crawled web page.
Awesome, right?
Finally, let’s convert that text into tokens by splitting the text like this:
from bs4 import BeautifulSoup import urllib.request response = urllib.request.urlopen('http://php.net/') html = response.read() soup = BeautifulSoup(html,"html5lib") text = soup.get_text(strip=True) tokens = [t for t in text.split()] print (tokens)
Count word frequency
The text is much better now. Let’s calculate the frequency distribution of those tokens using Python NLTK.
There is a function in NLTK called FreqDist() does the job:
from bs4 import BeautifulSoup import urllib.request import nltk response = urllib.request.urlopen('http://php.net/') html = response.read() soup = BeautifulSoup(html,"html5lib") text = soup.get_text(strip=True) tokens = [t for t in text.split()] freq = nltk.FreqDist(tokens) for key,val in freq.items(): print (str(key) + ':' + str(val))
If you search the output, you’ll find that the most frequent token is PHP.
You can plot a graph for those tokens using plot function like this:
freq.plot(20, cumulative=False)
From the graph, you can be sure that this article is talking about PHP.
Great!!
There are some words like The, Of, a, an, and so on. These words are stop words. Generally, you should remove stop words to prevent them from affecting our results.
Remove stop words using NLTK
NLTK comes with stop words lists for most languages. To get English stop words, you can use this code:
from nltk.corpus import stopwords stopwords.words('english')
Now, let’s modify our code and clean the tokens before plotting the graph.
First, we will make a copy of the list; then we will iterate over the tokens and remove the stop words:
clean_tokens = tokens[:] sr = stopwords.words('english') for token in tokens: if token in stopwords.words('english'): clean_tokens.remove(token)
You can review the Python list functions to know how to process lists.
So the final code should be like this:
from bs4 import BeautifulSoup import urllib.request import nltk from nltk.corpus import stopwords response = urllib.request.urlopen('http://php.net/') html = response.read() soup = BeautifulSoup(html,"html5lib") text = soup.get_text(strip=True) tokens = [t for t in text.split()] clean_tokens = tokens[:] sr = stopwords.words('english') for token in tokens: if token in stopwords.words('english'): clean_tokens.remove(token) freq = nltk.FreqDist(clean_tokens) for key,val in freq.items(): print (str(key) + ':' + str(val))
If you check the graph now, it’s better than before since no stop words on the count.
freq.plot(20,cumulative=False)
Tokenize text using NLTK
We saw how to split the text into tokens using the split function. Now we will see how to tokenize the text using NLTK.
Tokenizing text is important since text can’t be processed without tokenization. The tokenization process means splitting bigger parts into small parts.
You can tokenize paragraphs to sentences and tokenize sentences to words according to your needs. NLTK comes with sentence tokenizer and word tokenizer.
Let’s assume that we have a sample text like the following:
Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude.
To tokenize this text to sentences, we will use sentence tokenizer:
from nltk.tokenize import sent_tokenize mytext = "Hello Adam, how are you? I hope everything is going well. Today is a good day, see you dude." print(sent_tokenize(mytext))
The output is the following:
['Hello Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']
You may say that this is an easy job, I don’t need to use NLTK tokenization, and I can split sentences using regular expressions since every sentence precedes by punctuation and space.
Well, take a look at the following text:
Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude.
Uh! The word Mr. is one word by itself. OK, let’s try NLTK:
from nltk.tokenize import sent_tokenize mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude." print(sent_tokenize(mytext))
The output looks like this:
['Hello Mr. Adam, how are you?', 'I hope everything is going well.', 'Today is a good day, see you dude.']
Great! It works like a charm.
OK, let’s try word tokenizer to see how it will work.
from nltk.tokenize import word_tokenize mytext = "Hello Mr. Adam, how are you? I hope everything is going well. Today is a good day, see you dude." print(word_tokenize(mytext))
The output is:
['Hello', 'Mr.', 'Adam', ',', 'how', 'are', 'you', '?', 'I', 'hope', 'everything', 'is', 'going', 'well', '.', 'Today', 'is', 'a', 'good', 'day', ',', 'see', 'you', 'dude', '.']
The word Mr. is one word as expected.
NLTK uses PunktSentenceTokenizer which is a part of nltk.tokenize.punkt module.
This tokenizer trained well to work with many languages.
Tokenize non-English languages text
To tokenize other languages, you can specify the language like this:
from nltk.tokenize import sent_tokenize mytext = "Bonjour M. Adam, comment allez-vous? J'espère que tout va bien. Aujourd'hui est un bon jour." print(sent_tokenize(mytext,"french"))
The result will be like this:
['Bonjour M. Adam, comment allez-vous?', "J'espère que tout va bien.", "Aujourd'hui est un bon jour."]
We are doing well.
Get synonyms from WordNet
If you remember, we installed NLTK packages using nltk.download(). One of the packages was WordNet.
WordNet is a database that is built for natural language processing. It includes groups of synonyms and a brief definition.
You can get these definitions and examples for a given word like this:
from nltk.corpus import wordnet syn = wordnet.synsets("pain") print(syn[0].definition()) print(syn[0].examples())
The result is:
a symptom of some physical hurt or disorder ['the patient developed severe pain and distension']
WordNet includes a lot of definitions:
from nltk.corpus import wordnet syn = wordnet.synsets("NLP") print(syn[0].definition()) syn = wordnet.synsets("Python") print(syn[0].definition())
The result is:
the branch of information science that deals with natural language information large Old World boas
You can use WordNet to get synonymous words like this:
from nltk.corpus import wordnet synonyms = [] for syn in wordnet.synsets('Computer'): for lemma in syn.lemmas(): synonyms.append(lemma.name()) print(synonyms)
The output is:
['computer', 'computing_machine', 'computing_device', 'data_processor', 'electronic_computer', 'information_processing_system', 'calculator', 'reckoner', 'figurer', 'estimator', 'computer']
Cool!!
Get antonyms from WordNet
You can get the antonyms words the same way, all you have to do is to check the lemmas before adding them to the array if it’s an antonym or not.
from nltk.corpus import wordnet antonyms = [] for syn in wordnet.synsets("small"): for l in syn.lemmas(): if l.antonyms(): antonyms.append(l.antonyms()[0].name()) print(antonyms)
The output is:
['large', 'big', 'big']
This is the power of NLTK in natural language processing.
NLTK word stemming
Word stemming means removing affixes from words and return the root word. Ex: The stem of the word working => work.
Search engines use this technique when indexing pages, so many people write different versions for the same word, and all of them are stemmed from the root word.
There are many algorithms for stemming, but the most used algorithm is the Porter stemming algorithm.
NLTK has a class called PorterStemmer, which uses the Porter stemming algorithm.
from nltk.stem import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem('working'))
The result is:
work
Clear enough.
There are some other stemming algorithms like Lancaster stemming algorithm.
The output of this algorithm shows a bit different results for a few words. You can try both of them to see the result.
Stemming non-English words
SnowballStemmer can stem 13 languages besides the English language.
The supported languages are:
from nltk.stem import SnowballStemmer print(SnowballStemmer.languages)
'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish'
You can use the stem function of the SnowballStemmer class to stem non-English words like this:
from nltk.stem import SnowballStemmer french_stemmer = SnowballStemmer('french') print(french_stemmer.stem("French word"))
The French people can tell us about the results :).
Lemmatizing words using WordNet
Word lemmatizing is similar to stemming, but the difference is the result of lemmatizing is a real word.
Unlike stemming, when you try to stem some words, it will result in something like this:
from nltk.stem import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem('increases'))
The result is:
increas
Now, if we try to lemmatize the same word using NLTK WordNet, the result is correct:
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize('increases'))
The result is
increase
The result might end up with a synonym or a different word with the same meaning.
Sometimes, if you try to lemmatize a word like the word playing, it will end up with the same word.
This is because the default part of the speech is nouns. To get verbs, you should specify it like this:
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize('playing', pos="v"))
The result is:
play
This is a very good level of text compression; you end up with about 50% to 60% compression.
The result could be a verb, noun, adjective, or adverb:
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize('playing', pos="v")) print(lemmatizer.lemmatize('playing', pos="n")) print(lemmatizer.lemmatize('playing', pos="a")) print(lemmatizer.lemmatize('playing', pos="r"))
The result is:
play playing playing playing
Stemming and lemmatization difference
OK, let’s try stemming and lemmatization for some words:
from nltk.stem import WordNetLemmatizer from nltk.stem import PorterStemmer stemmer = PorterStemmer() lemmatizer = WordNetLemmatizer() print(stemmer.stem('stones')) print(stemmer.stem('speaking')) print(stemmer.stem('bedroom')) print(stemmer.stem('jokes')) print(stemmer.stem('lisa')) print(stemmer.stem('purple')) print('----------------------') print(lemmatizer.lemmatize('stones')) print(lemmatizer.lemmatize('speaking')) print(lemmatizer.lemmatize('bedroom')) print(lemmatizer.lemmatize('jokes')) print(lemmatizer.lemmatize('lisa')) print(lemmatizer.lemmatize('purple'))
The result is:
stone speak bedroom joke lisa purpl ---------------------- stone speaking bedroom joke lisa purple
Stemming works on words without knowing its context, and that’s why stemming has lower accuracy and faster than lemmatization.
In my opinion, lemmatizing is better than stemming. Word lemmatizing returns a real word even if it’s not the same word, it could be a synonym, but at least it’s a real word.
Sometimes you don’t care about this level of accuracy and all you need is speed; in this case, stemming is better.
All step we discussed in this NLP tutorial was text preprocessing. In future posts, we will discuss text analysis using Python NLTK.
I’ve done my best to make the article easy and as simple as possible. I hope you find it useful.
Keep coming back. Thank you.
Mokhtar is the founder of LikeGeeks.com. He works as a Linux system administrator since 2010. He is responsible for maintaining, securing, and troubleshooting Linux servers for multiple clients around the world. He loves writing shell and Python scripts to automate his work.
Thank you for the post, it´s really cool!
You are welcome. Thank you very much.
This blog was helpful to me! I look forward to reading your future post on text analysis.
Great to hear that. Thank you very much.
Very nicely written article. I like the examples you have provided and am waiting to look at your next article
Thank you very much. Great to hear that.
I’m always doing my best to write the best content.
Is the “Text Analysis with Python and NLTK” post has been posted ?
This is the first post about NLTK.
Very nice article. Loving it 🙂
Thanks a lot.
thank you so much ADMIN
You are welcome! Thanks!
Thank you for the great explanation very straight to the point 🙂
You are welcome! Thanks!
Just amazing, thank you so much. I look forward to learn more and more about NLP from you. 🙂
Thank you very much! Appreciate it!
Great explanation about NLP.. Good Job..
Thanks a lot!
Thanks ! it’s a great introduction to NLP
Thank you Ahmed!
My first time with Python and you have made it look so easy. I am encouraged to dive into NLP using YOUR articles.
Great to hear that!
Best Regards,
What a great intro to NLTK and python!
Thank you very much!
Greetings.Good Job.Thanks.
Dr.Nirmal
Thank you very much Dr. Nirmal!
That keeps me doing my best.
Thanks for the simple and clear introduction to NLP!!
You’re welcome! Thank you.
Simple and clear. Waiting for your next topic, text analysis using Python NLTK and also its followings.
You’re welcome!
I’ll do my best.
Thanks for posting this wonderful tutorial.. waiting for the next post(analyzing using NLTK).. Thanks again
You’re welcome! Thank you very much!
awesome explanation. Example of how a real expert can will articulate the most complex in the most simple way…way to go Mokhtar Ebrahim.
Great to hear that.
Thanks Raja!
Your explanation was very useful, great intro, thank you
You’re welcome!
Thank you.
good
Thanks!