Customizing Bins in Seaborn Histogram (hisplot)

In this tutorial, we’ll learn various methods to customize bins in Seaborn hisplot.

You’ll learn how to dynamically adjust bin sizes, from determining the optimal number of bins to implementing advanced strategies like logarithmic binning and quantile-based binning.

Whether your data is evenly distributed, skewed, or includes outliers, the methods we explore will handle any dataset.

 

 

Determining the Number of Bins

By default, Seaborn chooses a bin number that generally represents your data well, but there are cases where you need to customize this number.

First, let’s import the necessary libraries and prepare some sample dataset:

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
np.random.seed(0)
data = pd.DataFrame({
    'MonthlyCharges': np.random.normal(70, 30, 1000)
})

Now, let’s visualize the distribution of MonthlyCharges using the default bin setting in Seaborn:

sns.histplot(data['MonthlyCharges'])
plt.show()

Code Output:

Default bin setting

To customize the bin number, let’s apply the Sturges’ formula, which is a popular choice for determining bin count:

bin_count_sturges = int(1 + 3.322 * np.log10(len(data)))
sns.histplot(data['MonthlyCharges'], bins=bin_count_sturges)
plt.show()

Code Output:

Sturges formula

For comparison, let’s apply the square-root rule:

bin_count_sqrt = int(np.sqrt(len(data)))
sns.histplot(data['MonthlyCharges'], bins=bin_count_sqrt)
plt.show()

Code Output:

square-root rule

Lastly, let’s look at the Rice Rule:

bin_count_rice = int(2 * len(data) ** (1/3))
sns.histplot(data['MonthlyCharges'], bins=bin_count_rice)
plt.show()

Code Output:

Rice Rule

 

Set Fixed Bin Widths

Setting fixed bin widths in Seaborn’s histplot allows you to maintain consistent bin sizes across different plots, making it easier to draw comparisons.

Assuming we still have our dataset data loaded, we first decide on the fixed width for each bin. For instance, suppose we want each bin to represent a range of $10 in monthly charges:

fixed_bin_width = 10
bins = np.arange(start=data['MonthlyCharges'].min(), 
                 stop=data['MonthlyCharges'].max() + fixed_bin_width, 
                 step=fixed_bin_width)
sns.histplot(data['MonthlyCharges'], bins=bins)
plt.show()

Code Output:

Setting Fixed Bin Widths

Now, let’s see what happens when we compare two distributions with the same fixed bin width. Suppose we have another set of monthly charges from a different year in our dataset:

data['MonthlyCharges_2ndYear'] = np.random.normal(75, 35, 1000)

# Plotting histograms for both sets of data with fixed bin widths
sns.histplot(data['MonthlyCharges'], bins=bins, color='blue', alpha=0.5)
sns.histplot(data['MonthlyCharges_2ndYear'], bins=bins, color='red', alpha=0.5)
plt.show()

Code Output:

compare two distributions

This direct comparison is made possible due to the consistent bin size across both histograms.

 

Define Custom Bin Edges (Range)

Defining custom bin edges in Seaborn histplot allows you to tailor the histogram to specific data ranges or to highlight particular aspects of your dataset.

This method is useful when the data includes outliers or when you want to focus on a specific range of values.

First, suppose we want to create bins that focus on specific ranges of monthly charges, like low, medium, and high spending categories:

custom_bins = [0, 50, 100, 150, 200]
sns.histplot(data['MonthlyCharges'], bins=custom_bins)
plt.show()

Code Output:

Define Custom Bin

Now, let’s say we want to explore the higher end of the spending scale in more detail. We can adjust our custom bins to focus on the upper range of monthly charges:

# Adjusting custom bins to focus on higher charges
custom_bins_high_end = [100, 125, 150, 175, 200, 225, 250]

# Plotting the histogram with the new set of custom bins
sns.histplot(data['MonthlyCharges'], bins=custom_bins_high_end)
plt.show()

Code Output:

focus on the upper range

 

Logarithmic Binning for Skewed Data

This method is useful for datasets with a wide range of values, where the majority of data points are clustered in one part of the spectrum.

First, ensure that there are no non-positive values in your data, as the logarithm of zero or negative numbers is undefined. Then, proceed with the logarithmic binning:

# Filtering out non-positive values
data_filtered = data[data['MonthlyCharges'] > 0]

# Applying logarithmic binning
bin_edges = np.logspace(start=np.log10(data_filtered['MonthlyCharges'].min()), 
                        stop=np.log10(data_filtered['MonthlyCharges'].max()), 
                        num=20) # You can adjust the number of bins

sns.histplot(data_filtered['MonthlyCharges'], bins=bin_edges)
plt.xscale('log') # Setting the x-axis to a logarithmic scale
plt.show()

Code Output:

Logarithmic Binning

 

Adaptive Binning Strategies

Adaptive binning strategies in histogram plotting provide a dynamic way to visualize data, where the bin widths are not uniform but instead vary according to the density of the data points.

This method is useful for datasets with unevenly distributed values, it allows for more detailed representation in areas with higher data concentration and a broader view in sparser areas.

To implement adaptive binning, you can use kernel density estimation (KDE) to identify dense regions of data and adjust bin sizes accordingly.

from scipy.stats import gaussian_kde

# Estimate the density of the data
kde = gaussian_kde(data['MonthlyCharges'])
density = kde(data['MonthlyCharges'])
sorted_data, sorted_density = zip(*sorted(zip(data['MonthlyCharges'], density)))

# Adaptive binning based on density
adaptive_bins = np.interp(np.linspace(0, len(data), 20), 
                          np.arange(len(data)), 
                          sorted_data)
sns.histplot(sorted_data, bins=adaptive_bins)
plt.show()

Code Output:

Adaptive Binning

Areas with a high concentration of data points have narrower bins, whereas sparser areas are represented with wider bins.

 

Binning based on Data Distribution

This method is effective for datasets with a skewed distribution as it helps in visualizing the spread and concentration of data across different segments.

This will divide the data into bins such that each bin has an equal number of data points.

num_quantile_bins = 4 

# Calculate quantiles
quantile_bins = data['MonthlyCharges'].quantile(np.linspace(0, 1, num_quantile_bins + 1))
sns.histplot(data['MonthlyCharges'], bins=quantile_bins)
plt.xticks(quantile_bins)
plt.show()

Code Output:

quantile bins

 

Underflow and Overflow Bins

Underflow and overflow bins are used to capture the data points that fall below or above the specified range of the histogram.

To manage underflow and overflow in Seaborn’s histplot, we can set explicit lower and upper bounds for our bins.

Data points falling outside these bounds will be accumulated in the underflow and overflow bins.

lower_bound = 20
upper_bound = 180
bins = np.arange(lower_bound, upper_bound, 10)
sns.histplot(data['MonthlyCharges'], bins=bins, binrange=(lower_bound, upper_bound))
plt.show()

Code Output:

Underflow and Overflow Bins

It’s important to note that the choice of bounds should be based on your understanding of the dataset and the specific requirements of your analysis.

For instance, if you know that certain extreme values are anomalies or errors, you might want to set your bounds to exclude these points.

Leave a Reply

Your email address will not be published. Required fields are marked *