Python standard deviation tutorial

Standard deviation is a metric that quantifies the level of diversity or scattering within a dataset. A small standard deviation means that the data points tend to be near the average value, whereas a large standard deviation indicates that the data points are widely dispersed.

In the following sections of this tutorial, we will dive deep into more details about how you can calculate and understand standard deviations using Python.

 

 

Types of Standard Deviation

There are two types of standard deviation: population standard deviation and sample standard deviation.

The population standard deviation is used when your data includes all the members of the population. It simply calculates the root mean square deviation of the values from the mean.

The sample standard deviation is used when your data is a subset or a sample of the entire population. It assumes that your data does not include the entire set of possible data.

However, the sample data is not only smaller in size but also the sample should cover the full range of the population data.

Assume your population data is [2, 4, 6, 8, 10]. To cover a full range, your sample data could be one of the following:

[2, 6, 8]
[2, 8, 10]
[2, 8]
[2,10]

If your sample data covered the full range of the population, the population standard deviation will be lower or at least equal to the sample standard deviation when calculated from the same dataset.

If the sample data does not cover the full range of the population data. This will lead to a lower standard deviation for the sample data.

As a result, any statistics calculated from the sample (like the mean or standard deviation) may not accurately reflect the true population values.

Examples of sample data bad coverage:

[2, 4]
[2, 4, 6]

 

Calculating Standard Deviation Manually

If you want to understand the underlying computations of standard deviation, you should implement it manually.

The formula for standard deviation is the square root of the variance, where variance is the average of the squared deviations from the mean.
Here is how we can do it:

  1. Calculate the mean of the dataset.
  2. For each number in the dataset, subtract the mean and square the result.
  3. Calculate the average of these squared deviations.
  4. Take the square root of the result from step 3.

Let’s implement this using the math module:

import math
data = [2, 4, 6, 8, 10]

# Step 1: Calculate the mean
mean = sum(data) / len(data)

# Step 2 and 3: Calculate the average of the squared deviations
variance = sum((x - mean) ** 2 for x in data) / len(data)

# Step 4: Calculate the standard deviation
std_dev = math.sqrt(variance)
print("Population Standard Deviation: ", std_dev)

Output:

Population Standard Deviation: 2.8284271247461903

Calculating the sample standard deviation is slightly different from the population standard deviation.

The formula to compute the sample standard deviation is also the square root of the variance, but instead of dividing by the number of data points (N), we divide by N-1.

import math
data = [2, 8, 10]

# Step 1: Calculate the mean
mean = sum(data) / len(data)

# Step 2 and 3: Calculate the average of the squared deviations
variance = sum((x - mean) ** 2 for x in data) / (len(data) - 1)

# Step 4: Calculate the standard deviation
std_dev = math.sqrt(variance)

print("Sample Standard Deviation: ", std_dev)

Output:

Sample Standard Deviation: 4.163331998932266

 

Using statistics.stdev()

The Python standard library’s statistics module provides a convenient function stdev() to calculate the sample standard deviation. Here’s an example:

import statistics
data = [2, 8, 10]

# Calculate standard deviation
std_dev = statistics.stdev(data)
print("Standard Deviation: ", std_dev)

Output:

Standard Deviation: 4.163331998932265

In this example, we first import the statistics module and then use its stdev() function to calculate the standard deviation of the data list.

The result (4.16) indicates that the data points are, on average, approximately 4.16 units away from the mean.
This is the standard deviation calculated for a sample. If you’re working with the entire population, use statistics.pstdev() instead.

 

Using statistics.pstdev()

To calculate the population standard deviation using Python’s built-in statistics module, you can use the pstdev() function.

This function works similarly to stdev(), but it calculates the standard deviation assuming that the provided data represents the entire population.
Here’s an example:

import statistics

# Population data
data = [2, 4, 6, 8, 10]

# Calculate population standard deviation
std_dev = statistics.pstdev(data)
print("Standard Deviation: ", std_dev)

Output:

Standard Deviation: 2.8284271247461903

In this example, we’re computing the standard deviation of data using statistics.pstdev(). The resulting standard deviation (2.82) tells us that, on average, each data point in data deviates by approximately 2.82 from the mean.
Remember, this function should be used when your data represents the entire population, not just a sample.

 

Using numpy.std()

The np.std() function in NumPy allows you to compute the standard deviation.

This function is versatile as it not only calculates the standard deviation of a list or a NumPy array, but it also allows calculations along a specified axis for multi-dimensional arrays.
Here’s how you can use np.std():

import numpy as np

# Data as a numpy array
data = np.array([2, 4, 6, 8, 10])

# Calculate standard deviation
std_dev = np.std(data)
print("Standard Deviation: ", std_dev)

Output:

Standard Deviation: 2.8284271247461903

By default, np.std() calculates the population standard deviation. If you want to calculate the sample standard deviation instead, you can set the ddof (Delta Degrees of Freedom) parameter to 1:

# Calculate sample standard deviation
data=np.array([2, 8, 10])
sample_std_dev = np.std(data, ddof=1)
print("Sample Standard Deviation: ", sample_std_dev)

Output:

Sample Standard Deviation: 4.163331998932266

In this case, the standard deviation is larger, reflecting the increased uncertainty when working with a sample.

 

Using pandas.DataFrame.std()

The pandas.DataFrame.std() function is built to handle standard deviation calculations for large DataFrames, column by column.
Here’s a simple example:

import pandas as pd
data = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [2, 4, 6, 8, 10],
    'C': [3, 6, 9, 12, 15]
})

# Calculate standard deviation
std_dev = data.std()
print("Standard Deviation: ")
print(std_dev)

Output:

Standard Deviation: 
A    1.581139
B    3.162278
C    4.743416
dtype: float64

In this example, pandas.DataFrame.std() computes the standard deviation for each column in the dataframe and returns a new dataframe with these values.

By default, this function calculates the sample standard deviation.

If you want to compute the population standard deviation, you can set the ddof parameter to 0:

# Calculate population standard deviation
pop_std_dev = data.std(ddof=0)
print("Population Standard Deviation: ")
print(pop_std_dev)

Output:

Population Standard Deviation: 
A    1.414214
B    2.828427
C    4.242641
dtype: float64

As with the NumPy function, setting ddof to 0 calculates the population standard deviation, resulting in slightly smaller values.

 

Standard Deviation For the Friendship

It was a text from a friend, Adam, who was now a small-time investor.

The message read, “Hey, could you help me with something? I’ve been dabbling with some stocks, can you use your Python wizardry to check the volatility of the closing prices for one of my stocks over the last 10 days?”

Even though I wasn’t a financial guru, my skills in Python programming could indeed help.

“Sure,” I texted back, “Just send over the data.”

A minute later, another text from Adam popped up with the closing prices for his stock: 100, 105, 102, 106, 102, 107, 104, 103, 105, 106.

First, calculate the returns for each period. For stock prices, a simple return for each period can be calculated as (price_now - price_previous) / price_previous.

The return represents the percentage change in price from one period to the next.

Then, calculate the standard deviation of these returns.

import pandas as pd
import numpy as np
prices = pd.Series([100, 105, 102, 106, 102, 107, 104, 103, 105, 106])

# Calculate returns
returns = prices.pct_change()

# Calculate volatility
volatility = np.std(returns, ddof=1)  # ddof=1 for sample standard deviation
print(f"Volatility: {volatility*100:.2f}%")

The pct_change() function calculates the percentage change between the current and a prior element. The np.std() function then calculates the standard deviation of these returns.

We can achieve the same result without pct_change() function:

import numpy as np
prices = np.array([100, 105, 102, 106, 102, 107, 104, 103, 105, 106])

# Calculate returns
returns = np.diff(prices) / prices[:-1]

# Calculate volatility
volatility = np.std(returns, ddof=1)
print(f"Volatility: {volatility*100:.2f}%")

The output was 3.46%, a concise number representing the volatility of Adam’s chosen stock. Higher the standard deviation, higher the volatility, and vice versa.

I quickly texted Adam: “Done. The Volatility: 3.46%.

A moment later, my phone buzzed with Adam’s reply: “You’re a lifesaver! This really helps!

 

F.A.Q

Q: What is the Python statistics module?

A: The Python statistics module is a built-in library in Python used for mathematical statistics tasks.

It provides functions to calculate mathematical statistics of sample data, such as mean, median, mode, variance, and standard deviation, among others.

 

Q: How can I calculate the standard deviation of a list in Python?

A: To calculate the standard deviation of a list in Python using the statistics module, you would pass the list as a parameter to the stdev() function, which then returns the standard deviation of the list.

 

Q: How to get NumPy standard deviation of an array?

A: NumPy provides the numpy.std() function for calculating the standard deviation of an array.

By default, the standard deviation of the flattened array is computed, but you can also specify the axis parameter to compute the standard deviation along a specified axis.

 

Q: What does it mean to have a small standard deviation or a large standard deviation?

A: A small standard deviation indicates that values are close to the mean of the data set, whereas a large standard deviation suggests that values are spread out wide from the mean. It reflects the degree of dispersion in the data.

 

Q: Does NumPy standard deviation function calculate both the sample and population standard deviation?

A: Yes, you can calculate both the sample and population standard deviation using NumPy. By default, numpy.std() function calculates the population standard deviation.

To calculate the sample standard deviation, you would need to set the optional parameter, ddof (delta degrees of freedom), to 1.

 

Resources

https://en.wikipedia.org/wiki/Standard_deviation

https://docs.python.org/3/library/statistics.html

Leave a Reply

Your email address will not be published. Required fields are marked *