Remove Outliers from Histogram Visualizations in Python

In this tutorial, you’ll learn various methods to identify and remove outliers from your Seaborn histogram data.

We’ll explore methods like the Z-score method, Interquartile Range (IQR) method, and Standard Deviation method.

 

 

Impact of Outliers on Data Visualization

We’ll start by generating a sample dataset with outliers.

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
data = np.random.normal(50, 15, 1000)

# Introducing outliers
data_with_outliers = np.append(data, [200, 205, 210])
plt.hist(data_with_outliers, bins=30, color='skyblue', edgecolor='black')
plt.title('Histogram with Outliers')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Code Output:

Histogram with outliers

In this histogram, you can see how the outliers (values around 200) create a skewed view of the data distribution.

The majority of the data is centered around 50, but the presence of extreme values distorts the scale, making it harder to analyze the bulk of the data effectively.

 

Using Z-score Method

The Z-score method measures how many standard deviations away a data point is from the mean. Typically, data points with a Z-score greater than 3 or less than -3 are considered outliers.

We’ll calculate the Z-scores for each data point and then filter out the outliers.

from scipy import stats

# Calculating Z-scores
z_scores = stats.zscore(data_with_outliers)

# Filtering data without outliers
data_without_outliers = data_with_outliers[(z_scores > -3) & (z_scores < 3)]
plt.hist(data_without_outliers, bins=30, color='lightgreen', edgecolor='black')
plt.title('Histogram without Outliers')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Code Output:

Using Z-score Method

The data is now more centrally clustered, offering a clearer view of the true distribution.

 

 

Using IQR (Interquartile Range) Method

Another method for identifying and removing outliers is the Interquartile Range (IQR) method.

The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data.

Outliers are typically defined as observations that fall below Q1 – 1.5_IQR or above Q3 + 1.5_IQR.

# Calculating Q1, Q3, and IQR
Q1 = np.percentile(data_with_outliers, 25)
Q3 = np.percentile(data_with_outliers, 75)
IQR = Q3 - Q1

# Filtering data based on IQR
data_iqr_filtered = data_with_outliers[(data_with_outliers >= (Q1 - 1.5 * IQR)) & (data_with_outliers <= (Q3 + 1.5 * IQR))]
plt.hist(data_iqr_filtered, bins=30, color='lightcoral', edgecolor='black')
plt.title('Histogram after IQR Filtering')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Code Output:

Using IQR Method

This approach using the IQR method has eliminated extreme values.

 

Using Standard Deviation Method

In this method, we consider data points that lie beyond a certain number of standard deviations from the mean as outliers.

A common practice is to exclude data points that are more than two or three standard deviations away from the mean.

# Calculating mean and standard deviation
mean = np.mean(data_with_outliers)
std_deviation = np.std(data_with_outliers)

# Defining the threshold for outliers
threshold = 3 * std_deviation

# Filtering out outliers
data_std_filtered = data_with_outliers[(data_with_outliers > (mean - threshold)) & (data_with_outliers < (mean + threshold))]
plt.hist(data_std_filtered, bins=30, color='goldenrod', edgecolor='black')
plt.title('Histogram after Standard Deviation Filtering')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Code Output:

Using Standard Deviation

 

Ensuring Data Integrity Post Outlier Removal

After removing outliers from your dataset, it’s important to ensure that the integrity of your data remains intact.

This means verifying that the data still accurately reflects the case you’re analyzing.

Here are steps to validate data integrity post outlier removal:

  1. Statistical Summary: Examine basic statistical measures before and after outlier removal.
  2. Distribution Analysis: Compare the distributions pre and post outlier removal.
  3. Consistency Check: Ensure that the data still aligns with expected patterns or known benchmarks.

Let’s apply these steps to our dataset:

# Statistical Summary before outlier removal
original_stats = {
    "Mean": np.mean(data_with_outliers),
    "Standard Deviation": np.std(data_with_outliers),
    "Min": np.min(data_with_outliers),
    "Max": np.max(data_with_outliers)
}

# Statistical Summary after outlier removal using Standard Deviation method
filtered_stats = {
    "Mean": np.mean(data_std_filtered),
    "Standard Deviation": np.std(data_std_filtered),
    "Min": np.min(data_std_filtered),
    "Max": np.max(data_std_filtered)
}

print("Original Data Statistics:", original_stats)
print("Filtered Data Statistics:", filtered_stats)

Code Output:

Original Data Statistics: {'Mean': 49.78678902058531, 'Standard Deviation': 17.054922584770093, 'Min': 4.307854178001101, 'Max': 210.0}
Filtered Data Statistics: {'Mean': 49.32114938764706, 'Standard Deviation': 14.805497380035387, 'Min': 4.307854178001101, 'Max': 91.39032671032373}

The output shows that the mean and standard deviation have become more representative of the central data points after outlier removal.

Leave a Reply

Your email address will not be published. Required fields are marked *