Python standard deviation tutorial
Standard deviation is a metric that quantifies the level of diversity or scattering within a dataset. A small standard deviation means that the data points tend to be near the average value, whereas a large standard deviation indicates that the data points are widely dispersed.
In the following sections of this tutorial, we will dive deep into more details about how you can calculate and understand standard deviations using Python.
Types of Standard Deviation
There are two types of standard deviation: population standard deviation and sample standard deviation.
The population standard deviation is used when your data includes all the members of the population. It simply calculates the root mean square deviation of the values from the mean.
The sample standard deviation is used when your data is a subset or a sample of the entire population. It assumes that your data does not include the entire set of possible data.
However, the sample data is not only smaller in size but also the sample should cover the full range of the population data.
Assume your population data is [2, 4, 6, 8, 10]
. To cover a full range, your sample data could be one of the following:
[2, 6, 8] [2, 8, 10] [2, 8] [2,10]
If your sample data covered the full range of the population, the population standard deviation will be lower or at least equal to the sample standard deviation when calculated from the same dataset.
If the sample data does not cover the full range of the population data. This will lead to a lower standard deviation for the sample data.
As a result, any statistics calculated from the sample (like the mean or standard deviation) may not accurately reflect the true population values.
Examples of sample data bad coverage:
[2, 4] [2, 4, 6]
Calculating Standard Deviation Manually
If you want to understand the underlying computations of standard deviation, you should implement it manually.
The formula for standard deviation is the square root of the variance, where variance is the average of the squared deviations from the mean.
Here is how we can do it:
- Calculate the mean of the dataset.
- For each number in the dataset, subtract the mean and square the result.
- Calculate the average of these squared deviations.
- Take the square root of the result from step 3.
Let’s implement this using the math module:
import math data = [2, 4, 6, 8, 10] # Step 1: Calculate the mean mean = sum(data) / len(data) # Step 2 and 3: Calculate the average of the squared deviations variance = sum((x - mean) ** 2 for x in data) / len(data) # Step 4: Calculate the standard deviation std_dev = math.sqrt(variance) print("Population Standard Deviation: ", std_dev)
Output:
Population Standard Deviation: 2.8284271247461903
Calculating the sample standard deviation is slightly different from the population standard deviation.
The formula to compute the sample standard deviation is also the square root of the variance, but instead of dividing by the number of data points (N
), we divide by N-1
.
import math data = [2, 8, 10] # Step 1: Calculate the mean mean = sum(data) / len(data) # Step 2 and 3: Calculate the average of the squared deviations variance = sum((x - mean) ** 2 for x in data) / (len(data) - 1) # Step 4: Calculate the standard deviation std_dev = math.sqrt(variance) print("Sample Standard Deviation: ", std_dev)
Output:
Sample Standard Deviation: 4.163331998932266
Using statistics.stdev()
The Python standard library’s statistics
module provides a convenient function stdev()
to calculate the sample standard deviation. Here’s an example:
import statistics data = [2, 8, 10] # Calculate standard deviation std_dev = statistics.stdev(data) print("Standard Deviation: ", std_dev)
Output:
Standard Deviation: 4.163331998932265
In this example, we first import the statistics
module and then use its stdev()
function to calculate the standard deviation of the data
list.
The result (4.16
) indicates that the data points are, on average, approximately 4.16 units away from the mean.
This is the standard deviation calculated for a sample. If you’re working with the entire population, use statistics.pstdev()
instead.
Using statistics.pstdev()
To calculate the population standard deviation using Python’s built-in statistics
module, you can use the pstdev()
function.
This function works similarly to stdev()
, but it calculates the standard deviation assuming that the provided data represents the entire population.
Here’s an example:
import statistics # Population data data = [2, 4, 6, 8, 10] # Calculate population standard deviation std_dev = statistics.pstdev(data) print("Standard Deviation: ", std_dev)
Output:
Standard Deviation: 2.8284271247461903
In this example, we’re computing the standard deviation of data
using statistics.pstdev()
. The resulting standard deviation (2.82
) tells us that, on average, each data point in data
deviates by approximately 2.82
from the mean.
Remember, this function should be used when your data represents the entire population, not just a sample.
Using numpy.std()
The np.std()
function in NumPy allows you to compute the standard deviation.
This function is versatile as it not only calculates the standard deviation of a list or a NumPy array, but it also allows calculations along a specified axis for multi-dimensional arrays.
Here’s how you can use np.std()
:
import numpy as np # Data as a numpy array data = np.array([2, 4, 6, 8, 10]) # Calculate standard deviation std_dev = np.std(data) print("Standard Deviation: ", std_dev)
Output:
Standard Deviation: 2.8284271247461903
By default, np.std()
calculates the population standard deviation. If you want to calculate the sample standard deviation instead, you can set the ddof
(Delta Degrees of Freedom) parameter to 1
:
# Calculate sample standard deviation data=np.array([2, 8, 10]) sample_std_dev = np.std(data, ddof=1) print("Sample Standard Deviation: ", sample_std_dev)
Output:
Sample Standard Deviation: 4.163331998932266
In this case, the standard deviation is larger, reflecting the increased uncertainty when working with a sample.
Using pandas.DataFrame.std()
The pandas.DataFrame.std()
function is built to handle standard deviation calculations for large DataFrames, column by column.
Here’s a simple example:
import pandas as pd data = pd.DataFrame({ 'A': [1, 2, 3, 4, 5], 'B': [2, 4, 6, 8, 10], 'C': [3, 6, 9, 12, 15] }) # Calculate standard deviation std_dev = data.std() print("Standard Deviation: ") print(std_dev)
Output:
Standard Deviation: A 1.581139 B 3.162278 C 4.743416 dtype: float64
In this example, pandas.DataFrame.std()
computes the standard deviation for each column in the dataframe and returns a new dataframe with these values.
By default, this function calculates the sample standard deviation.
If you want to compute the population standard deviation, you can set the ddof
parameter to 0
:
# Calculate population standard deviation pop_std_dev = data.std(ddof=0) print("Population Standard Deviation: ") print(pop_std_dev)
Output:
Population Standard Deviation: A 1.414214 B 2.828427 C 4.242641 dtype: float64
As with the NumPy function, setting ddof
to 0
calculates the population standard deviation, resulting in slightly smaller values.
Standard Deviation For the Friendship
It was a text from a friend, Adam, who was now a small-time investor.
The message read, “Hey, could you help me with something? I’ve been dabbling with some stocks, can you use your Python wizardry to check the volatility of the closing prices for one of my stocks over the last 10 days?”
Even though I wasn’t a financial guru, my skills in Python programming could indeed help.
“Sure,” I texted back, “Just send over the data.”
A minute later, another text from Adam popped up with the closing prices for his stock: 100, 105, 102, 106, 102, 107, 104, 103, 105, 106.
First, calculate the returns for each period. For stock prices, a simple return for each period can be calculated as (price_now - price_previous) / price_previous
.
The return represents the percentage change in price from one period to the next.
Then, calculate the standard deviation of these returns.
import pandas as pd import numpy as np prices = pd.Series([100, 105, 102, 106, 102, 107, 104, 103, 105, 106]) # Calculate returns returns = prices.pct_change() # Calculate volatility volatility = np.std(returns, ddof=1) # ddof=1 for sample standard deviation print(f"Volatility: {volatility*100:.2f}%")
The pct_change()
function calculates the percentage change between the current and a prior element. The np.std()
function then calculates the standard deviation of these returns.
We can achieve the same result without pct_change()
function:
import numpy as np prices = np.array([100, 105, 102, 106, 102, 107, 104, 103, 105, 106]) # Calculate returns returns = np.diff(prices) / prices[:-1] # Calculate volatility volatility = np.std(returns, ddof=1) print(f"Volatility: {volatility*100:.2f}%")
The output was 3.46%, a concise number representing the volatility of Adam’s chosen stock. Higher the standard deviation, higher the volatility, and vice versa.
I quickly texted Adam: “Done. The Volatility: 3.46%.
A moment later, my phone buzzed with Adam’s reply: “You’re a lifesaver! This really helps!
F.A.Q
Q: What is the Python statistics module?
A: The Python statistics module is a built-in library in Python used for mathematical statistics tasks.
It provides functions to calculate mathematical statistics of sample data, such as mean, median, mode, variance, and standard deviation, among others.
Q: How can I calculate the standard deviation of a list in Python?
A: To calculate the standard deviation of a list in Python using the statistics module, you would pass the list as a parameter to the stdev() function, which then returns the standard deviation of the list.
Q: How to get NumPy standard deviation of an array?
A: NumPy provides the numpy.std() function for calculating the standard deviation of an array.
By default, the standard deviation of the flattened array is computed, but you can also specify the axis parameter to compute the standard deviation along a specified axis.
Q: What does it mean to have a small standard deviation or a large standard deviation?
A: A small standard deviation indicates that values are close to the mean of the data set, whereas a large standard deviation suggests that values are spread out wide from the mean. It reflects the degree of dispersion in the data.
Q: Does NumPy standard deviation function calculate both the sample and population standard deviation?
A: Yes, you can calculate both the sample and population standard deviation using NumPy. By default, numpy.std() function calculates the population standard deviation.
To calculate the sample standard deviation, you would need to set the optional parameter, ddof (delta degrees of freedom), to 1.
Resources
https://en.wikipedia.org/wiki/Standard_deviation
https://docs.python.org/3/library/statistics.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.