Python

Remove punctuation using Python

If you have ever worked processing a large amount of textual data, you would know the pain of finding and removing irrelevant words or characters from the text.
Doing this job manually, even with the help of modern word processors, can be time-consuming and frustrating.
Fortunately, programming languages such as Python support powerful text processing libraries that help us do such clean-up jobs efficiently.
In this tutorial, we will look at various ways of removing punctuation from a text in Python.

 

 

Why Remove punctuation?

Removing punctuation is a common preprocessing step in many data analysis and machine learning tasks.
For example, if you’re building a text classification model, or constructing a word cloud from a given text corpus, punctuation are of no use in such tasks and so we remove them at the pre-processing step.
If you’re working on user-generated text data such as social media posts, you’d encounter too much punctuation in the sentences, which may not be useful for the task at hand, and so removing all of them becomes an essential pre-processing task.

 

Using replace method

Python strings come with many useful methods. One such method is the replace method.
Using this method, you can replace a specific character or substring in a given string with another character or substring.
Let us look at an example.

s = "Hello World, Welcome to my blog."

print(s)

s1 = s.replace('W', 'V')

print(s1)

Output:

Basic use of replace method in Python

This method, by default, removes all occurrences of a given character or substring from the given string.
We can limit the number of occurrences to replace by passing a ‘count’ value as the 3rd parameter to the replace method.

Here’s an example where we first use the default value of count(-1) and then pass a custom value for it.

s = "Hello world, Welcome to my blog."

print(s)

s1 = s.replace('o', 'a')

print(f"After replacing all o's with a's: {s1}")

# replace only first 2 o's
s2 = s.replace('o', 'a', 2)

print(f"After replacing first two o's: {s2}")

Output:

Use of replace method with count param

It is important to note that in all our usages of the replace method, we’ve stored the result string in a new variable.
This is because strings are immutable. Unlike lists, we cannot modify them in place.
Hence, all string modification methods return a new, modified string that we store in a new variable.

Now let’s figure out how we should use this method to replace all occurrences of punctuation in a string.

We must first define a list of all punctuation that we are not interested in and want to get rid of.
We then iterate over each of these punctuations and pass it to the replace method called on the input string.
Also, since we want to remove the punctuation, we pass an empty string as the 2nd parameter to replace it.

user_comment = "NGL, i just loved the moviee...... excellent work !!!"

print(f"input string: {user_comment}")

clean_comment = user_comment #copy the string in new variable, we'll store the result in this variable

# define list of punctuation to be removed
punctuation = ['.','.','!']

# iteratively remove all occurrences of each punctuation in the input
for p in punctuation:

    clean_comment = clean_comment.replace(p,'') #not specifying 3rd param, since we want to remove all occurrences

print(f"clean string: {clean_comment}")

Output:

Using replace method to remove all punctuation

Since it was a short text, we could anticipate what kind of punctuation we would encounter.
But real-world inputs could span thousands of lines of texts, and it would be difficult to figure out which punctuation is present and need to be eliminated.
However, if we are aware of all the punctuation we may encounter in an English text, our task would become easy.
Python’s string class does provide all punctuation in the attribute string.punctuation. It’s a string of punctuation.

import string

all_punctuation = string.punctuation

print(f"All punctuation: {all_punctuation}")

Output:

Using string library to list all punctuation

Once we have all the punctuation as a sequence of characters, we can run the previous for loop on any text input, however large, and the output will be free of punctuation.

 

Using maketrans and translate

There is another way in Python using which we can replace all occurrences of a bunch of characters in a string by their corresponding equivalents as desired.
In this method, we first create a ‘translation table’ using str.translate. This table specifies a one-to-one mapping between characters.
We then pass this translation table to the translate method called on the input string.
This method returns a modified string where original characters are replaced by their replacements as defined in the translation table.

Let’s understand this through a simple example. We will replace all occurrences of ‘a’ with ‘e’, ‘o’ with ‘u’, and ‘i’ with ‘y’.

tr_table = str.maketrans('aoi', 'euy') #defining the translation table: a=>e, o=>u, i=>y

s = "i absolutely love the american ice-cream!"

print(f"Original string: {s}")

s1 = s.translate(tr_table) #or str.translate(s, tr_table)

print(f"Translated string: {s1}")

Output:

Basic use of maketrans and translate methods

In the maketrans method, the first two strings need to be of equal length, as each character in the 1st string corresponds to its replacement/translation in the 2nd string.
The method accepts an optional 3rd string parameter specifying characters that need to be mapped to None, meaning they don’t have replacements and hence will be removed (this is the functionality we need to remove punctuation).

We can also create the translation table using a dictionary of mappings instead of the two string parameters.

This additionally allows us to create character-to-strings mappings, which help us replace a single character with strings (which is impossible with string parameters).
The dictionary approach also helps us explicitly map any character(s) to None, indicating those characters need to be removed.

Let us use the previous example and create the mapping using a dictionary.
Now, we will also map ‘!’ to None, which will result in the removal of the punctuation from the input string.

mappings = {
    'a':'e',
    'o':'u',
    'i':'eye',
    '!': None
}

tr_table = str.maketrans(mappings) 

s = "i absolutely love the american ice-cream!"

print(f"Original string: {s}")

print(f"translation table: {tr_table}")

s1 = s.translate(tr_table) #or str.translate(s, tr_table)

print(f"Translated string: {s1}")

Output:

Use of maketrans with dictionary parameter

Note that when we print the translation table, the keys are integers instead of characters. These are the Unicode values of the characters we had defined when creating the table.

Finally, let’s use this approach to remove all punctuation occurrences from a given input text.

import string

s = """I reached at the front of the billing queue. The cashier started scanning my items, one after the other. 
Off went from my cart the almonds, the butter, the sugar, the coffee.... when suddenly I heard an old lady, the 3rd in queue behind me, scream at me, "What y'all taking all day for ! are you hoarding for the whole year !".
The cashier looked tensed, she dashed all the remaining products as fast as she could, and then squeaked in a nervous tone, "That would be 298.5, sir !"."""

print(f"input string:\n{s}\n")

tr_table = str.maketrans("","", string.punctuation)

s1 = s.translate(tr_table)

print(f"translated string:\n{s1}\n")

Output:

Use of maketrans and translate to remove all the punctuation

 

Using RegEx

RegEx, or Regular Expression, is a sequence of characters representing a string pattern.
In text-processing, it is used to find, replace, or delete all such substrings that match the pattern defined by the regular expression.
For eg. the regex “\d{10}” is used to represent 10-digit numbers, or the regex “[A-Z]{3}” is used to represent any 3-letter(uppercase) code. Let us use this to find country codes from a sentence.

import re 

# define regex pattern for 3-lettered country codes.
c_pattern = re.compile("[A-Z]{3}")

s = "At the Olympics, the code for Japan is JPN, and that of Brazil is BRA. RSA stands for the 'Republic of South Africa' while ARG for Argentina."

print(f"Input: {s}")

# find all substrings matching the above regex
countries = re.findall(c_pattern, s)

print(f"Countries fetched: {countries}")

Output:

Basic use of regex

All occurrences of 3-lettered uppercase codes have been identified with the help of the regex we defined.

If we want to replace all the matching patterns in the string with something, we can do so using the re.sub method.
Let us try replacing all occurrences of the country codes with a default code “DEF” in the earlier example.

c_pattern = re.compile("[A-Z]{3}")

s = "At the Olympics, the code for Japan is JPN, and that of Brazil is BRA. RSA stands for the 'Republic of South Africa' while ARG for Argentina.\n"

print(f"Input:\n{s}")

new_s = re.sub(c_pattern, "DEF", s)

print(f"After replacement:\n{new_s}")

Output:

Replacing regex matching strings

We can use the same method to replace all occurrences of the punctuation with an empty string. This would effectively remove all the punctuation from the input string.
But first, we need to define a regex pattern that would represent all the punctuation.
While there doesn’t exist any special character for punctuation, like \d for digits, we can either explicitly define all the punctuation that we’d like to replace,
Or we can define a regex to exclude all the characters that we would like to retain.

For example, if we know that we can expect only the English alphabet, digits, and whitespace, then we can exclude them all in our regex using the caret symbol ^.
Everything else by default will be matched and replaced.

Let’s define it both ways.

import string, re

p_punct1 = re.compile(f"[{string.punctuation}]") #trivial way of regex for punctuation

print(f"regex 1 for punctuation: {p_punct1}")

p_punct2 = re.compile("[^\w\s]") #definition by exclusion

print(f"regex 2 for punctuation: {p_punct2}")

Output:

Defining regex for punctuation

Now let us use both of them to replace all the punctuation from a sentence. We’ll use an earlier sentence that contains various punctuation.

import string

s = """I reached at the front of the billing queue. The cashier started scanning my items, one after the other. 
Off went from my cart the almonds, the butter, the sugar, the coffee.... when suddenly I heard an old lady, the 3rd in queue behind me, scream at me, "What y'all taking all day for ! are you hoarding for the whole year !".
The cashier looked tensed, she dashed all the remaining products as fast as she could, and then squeaked in a nervous tone, "That would be 298.5, sir !"."""

print(f"input string:\n{s}\n")

s1 = re.sub(p_punct1, "", s)

print(f"after removing punctuation using 1st regex:\n{s1}\n")

s2 = re.sub(p_punct2, "", s)

print(f"after removing punctuation using 2nd regex:\n{s2}\n")

Output:

Using regex to replace punctuation

Both of them produced results identical to each other and to the maketrans method we used earlier.

 

Using nltk

Python’s nltk is a popular, open-source NLP library. It offers a large range of language datasets, text-processing modules, and a host of other features required in NLP.
nltk has a method called word_tokenize, which is used to break the input sentence into a list of words. This is one of the first steps in any NLP pipeline.
Let’s look at an example.

import nltk

s = "We can't lose this game so easily, not without putting up a fight!"

tokens = nltk.word_tokenize(s)

print(f"input: {s}")

print(f"tokens: {tokens}")

Output:

Using nltk to tokenize given string

The default tokenizer being used by nltk retains punctuation and splits the tokens based on whitespace and punctuation.

We can use nltk’s RegexpTokenizer to specify token patterns using regex.

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer("\w+") #\w+ matches alphanumeric characters a-z,A-Z,0-9 and _

s = "We can't lose this game so easily, not without putting up a fight!"

tokens = tokenizer.tokenize(s)

print(f"input: {s}\n")

print(f"tokens: {tokens}\n")

new_s = " ".join(tokens)

print(f"New string: {new_s}\n")

Output:

Using nltk to remove punctuation

 

Remove punctuation from start and end only

If we want to remove the punctuation only from the start and end of the sentence, and not those between, we can define a regex representing such a pattern and use it to remove the leading and the trailing punctuation.

Let’s first use one such regular expression in an example, and then we will dive deeper into that regex.

import re

pattern = re.compile("(^[^\w\s]+)|([^\w\s]+$)")

sentence = '"I am going to be the best player in history!"'

print(sentence)

print(re.sub(pattern,"", sentence))

Output:

Using regex to remove punctuation from start and end

The output shows the quotes (“) at the beginning and end, as well as the exclamation mark (!) at the second-to-last position, have been removed.
The punctuation occurring between the words, on the other hand, is retained.

The regex being used to achieve this is (^[^\w\s]+)|([^\w\s]+$)

There are two, different patterns in this regex, each enclosed in parentheses and separated by an OR sign (|). That means, if either of the two patterns exists in the string, it will be identified by the given regex.
The first part of the regex is “^[^\w\s]+”. There are two caret signs (^) here, one inside the square brackets, and the other, outside.
The first caret i.e the one preceding the opening square bracket, tells the regex compiler to “match any substring that occurs at the BEGINNING of the sentence and matches the following pattern”.
The square brackets define a set of characters to match.
The caret inside the square bracket tells the compiler to “match everything EXCEPT \w and \s”. \w represents alphanumeric characters, and \s, whitespace.
Thus, everything at the beginning, other than alphanumeric characters and whitespace (which would essentially be the punctuation) will be represented by the first part of the regex.

The second component is almost similar to the first one, except that it matches the specified set of characters occurring AT THE END of the string. This is denoted by the trailing character $.

 

Remove punctuation and extra spaces

In addition to removing punctuation, removing extra spaces is a common preprocessing step.
Removing extra spaces doesn’t require the use of any regex or nltk method. Python string’s strip method is used to remove any leading or trailing whitespace characters.

s = " I have an idea! \t "

print(f"input string with white spaces = {s}, length = {len(s)}\n")

s1 = s.strip()

print(f"after removing spaces from both ends: {s1}, length = {len(s1)}")

Output:

Removing white spaces from both ends

The strip method removes white spaces only at the beginning and end of the string.
We would also like to remove the extra spaces between the words.
Both of these can be achieved by splitting the string with the split method, and then joining them using a single space ” “.

Let us combine the removal of punctuation and extra spaces in an example.

import string

tr_table = str.maketrans("","", string.punctuation) # for removing punctuation

s = '   "   I am going to be     the best,\t  the most-loved, and...    the richest player in history!  " '

print(f"Original string:\n{s},length = {len(s)}\n")

s = s.translate(tr_table)

print(f"After removing punctuation:\n{s},length = {len(s)}\n")

s = " ".join(s.split())

print(f"After removing extra spaces:\n{s},length = {len(s)}")

Output:

Removing punctuations and extra white spaces

 

Remove punctuation from a text file

So far, we have been working on short strings that were stored in variables of type str and were no longer than 2-3 sentences.
But in the real world, the actual data may be stored in large files on the disk.
In this section, we will look at how to remove punctuation from a text file.

First, let’s read the whole content of the file in a string variable and use one of our earlier methods to remove the punctuation from this content string before writing it into a new file.

import re

punct = re.compile("[^\w\s]")

input_file = "short_sample.txt"

output_file = "short_sample_processed.txt"

f = open(input_file)

file_content = f.read() #reading entire file content as string

print(f"File content: {file_content}\n")

new_file_content = re.sub(punct, "", file_content)

print(f"New file content: {new_file_content}\n")

# writing it to new file
with open(output_file, "w") as fw:

    fw.write(new_file_content)

Output:

Removing punctuation from a file by reading entire file at once

We read the entire file at once in the above example. The text file, however, may also span content up to millions of lines, amounting to a few hundred MBs or a few GBs.
In such a case, it doesn’t make sense to read the entire file at once, as that could lead to potential memory overload errors.

So, we will read the text file one line at a time, process it, and write it to the new file.
Doing this iteratively will not cause memory overload, however, it may add some overhead because repetitive input/output operations are costlier.

In the following example, we will remove punctuation from a text file(found here), which is a story about ‘The Devil With Three Golden Hairs’!

import re

punct = re.compile("[^\w\s]")

input_file = "the devil with three golden hairs.txt"

output_file = "the devil with three golden hairs_processed.txt"

f_reader = open(input_file)


# writing it to new file
with open(output_file, "w") as f_writer:

    for line in f_reader:

        line = line.strip() #removing whitespace at ends

        line = re.sub(punct, "",line) #removing punctuation

        line += "\n"

        f_writer.write(line)
        
print(f"First 10 lines of original file:")

with open(input_file) as f:

    i = 0

    for line in f:

        print(line,end="")

        i+=1

        if i==10:

            break
            
print(f"\nFirst 10 lines of output file:")

with open(output_file) as f:

    i = 0

    for line in f:

        print(line,end="")

        i+=1

        if i==10:

            break

Output:

Removing punctuation from a file by reading one line at a time

As seen from the first 10 lines, the punctuation has been removed from the input file, and the result is stored in the output file.

 

Remove all punctuation except apostrophe

Apostrophes, in the English language, carry semantic meanings. They are used to show possessive nouns, to shorten words by the omission of letters (eg. cannot=can’t, will not=won’t), etc.

So it becomes important to retain the apostrophe characters while processing texts to avoid losing these semantic meanings.

Let us remove all the punctuation but the apostrophes from a text.

s=""""I should like to have three golden hairs from the devil's head",
answered he, "else I cannot keep my wife".
No sooner had he entered than he noticed that the air was not pure. "I smell man's
flesh", said he, "all is not right here".
The queen, when she had received the letter and read it, did as was written in it, and had a splendid wedding-feast
prepared, and the king's daughter was married to the child of good fortune, and as the youth was handsome and friendly she lived
with him in joy and contentment."""

print(f"Input text:\n{s}\n")

tr_table = str.maketrans("","", string.punctuation)

del tr_table[ord("'")] #deleting ' from translation table

print(f"Removing punctuation except apostrophe:\n{s.translate(tr_table)}\n")

Output:

Removing all punctuation except apostrophe

A translation table is a dictionary whose keys are integer values. They are the Unicode equivalents of the characters.
The ord method returns the Unicode of any character. We use this to delete the Unicode of the apostrophe character from the translation table.

 

Performance Comparison

Now that we have seen so many different ways for removing punctuation in Python, let us compare them in terms of their time consumption.

We will compare the performances of replace, maketrans, regex, and nltk.

We will use tqdm module to measure the performance of each method.
We will run each method 100000 times.
Each time, we generate a random string of 1000 characters(a-z, A-Z,0-9, and punctuation) and use our methods to remove punctuation from them.

Output:

Comparing performances of different approaches removing punctuation

The str.maketrans method, in combination with str.translate is the fastest method of all, it took 26 seconds to finish 100000 iterations.
The str.replace came a close second taking 28 seconds to finish the task.
The slowest approach is the use of nltk’s tokenizers.

 

Conclusion

In this tutorial, we looked at and analyzed various methods of removing punctuation from text data.

We began by looking at the str.replace method. Then, we saw the use of translation tables to replace certain characters with other characters or None.

We then used the powerful regex expressions to match all punctuation in the string and remove them.
Next, we looked at a popular NLP library called nltk and used one of its text preprocessing methods called word_tokenize with the default tokenizer to fetch tokens from an input string. We also used the RegexpTokenizer for our specific use case.

We also saw how we can remove punctuation only from the start and end of the string.
We removed not only the punctuation but also the extra spaces at the two ends as well as between the words in the given text.
We also saw how we can retain the apostrophes while removing every other punctuation from the input text.

We saw how we can remove punctuation from any length of text stored in an external text file, and write the processed text in another text file.

Finally, we compared the performances of the 4 prominent methods we saw for removing punctuation from a string.

Leave a Reply

Your email address will not be published. Required fields are marked *