Python standard deviation tutorial
The standard deviation allows you to measure how spread out numbers in a data set are. Large values of standard deviations show that elements in a data set are spread further apart from their mean value. In this tutorial, we will calculate the standard deviation using Python.
Small standard deviations show that items don’t deviate significantly from the mean value of a data set.
Terminology
There are two standard deviation notions in statistics.
One is the population standard deviation. It computes the spread directly from all values in a population. You use it when the values you have at hand represent the entire population.
Another one is the sample standard deviation. It tries to estimate the population spread by using only a sample subset of values. You use it when the values you have at hand represent just a subset of the entire population.
The sample standard deviation is an approximate measure. It’s useful because frequently, the data population is too big, and all we can only directly measure a randomized sample of it.
Population and sample standard deviations are calculated using slightly different algorithms. Hence, when programming, you should always keep in mind which one you want to compute and call the appropriate APIs.
Standard deviation in Python
Since version 3.x Python includes a light-weight statistics module in a default distribution, this module provides a lot of useful functions for statistical computations.
There is also a full-featured statistics package NumPy, which is especially popular among data scientists.
The latter has more features but also represents a more massive dependency in your code.
Calculate for a list
Computing sample standard deviation on a list of values in Python can be accomplished with the statistics.stdev() function.
import statistics statistics.stdev([5.12, -34.11, 32.43, -1.3, 7.83, -0.32])
Population standard deviation is computed using slightly different function statistics.pstdev().
import statistics statistics.pstdev([5.12, -34.11, 32.43, -1.3, 7.83, -0.32])
In the examples which follow, we’ll be showing how to apply statistics.stdev() function to different Python data types. If you need to calculate the population standard deviation, use statistics.pstdev() function instead. The rest of the code must be identical.
Another option to compute a standard deviation for a list of values in Python is to use a NumPy scientific package.
It doesn’t come with Python by default, and you need to install it separately. The usual way of installing third-party packages in Python is to use a Python package installer pip.
pip3 install numpy
After you installed NumPy, computing the standard deviation is trivial. Note that numpy.std computes population standard deviation by default.
import numpy numpy.std([5.12, -34.11, 32.43, -1.3, 7.83, -0.32])
If you want to compute a sample standard deviation using the NumPy package, you will have to pass an additional argument ddof with a value of 1. ddof stands for delta degrees of freedom, which is a statistical notion used in estimating statistics of populations from samples of them.
import numpy numpy.std([5.12, -34.11, 32.43, -1.3, 7.83, -0.32], ddof=1)
Calculate for an array
If you work with large data sets, Python arrays may be more convenient than its more popular lists.
You can also perform an arithmetic operation on array variables as if they were singular values. In that case, an arithmetic operation will be applied to each value in an array independently.
In the example below, we also pass a d argument to an array constructor to indicate that our values are of type double.
import statistics from array import array statistics.pstdev(array('d', [5.12, -34.11, 32.43, -1.3, 7.83, -0.32]))
numpy.std works on array values as well.
import numpy from array import array numpy.std(array('d', [5.12, -34.11, 32.43, -1.3, 7.83, -0.32]), ddof=1)
Calculate for dictionary values
Sometimes your data is stored in a key-value data structure like the Python dict, rather than a sequential data structure like a list.
For example, you can have a data structure, which maps students to their test scores, as in the picture below.
If you want to compute a standard deviation of the test scores across all students, you can do it by calling statistics.pstdev on the dictionary values, without the keys. For that, call Python’s built-in dict.values() function.
import statistics scores = {'Kate': 73, 'Alex': 56, 'Cindy': 98} statistics.pstdev(scores.values())
Calculate for a matrix
For dealing with matrices, it is best to resort to the NumPy package. NumPy provides a numpy.matrix data type specifically designed for working with matrices.
Let’s generate a square 4×4 matrix.
import numpy m = numpy.matrix('4 7 2 6, 3 6 2 6, 0 0 1 3, 4 6 1 3')
With matrices, there are three ways to compute standard deviations.
You can compute standard deviations by column (numpy.matrix.std(0)), by row (numpy.matrix.std(1)) or for all elements, as if the matrix was a vector (numpy.matrix.std()).
import numpy m = numpy.matrix('4 7 2 6; 3 6 2 6; 0 0 1 3; 4 6 1 3') m.std(0) # by column m.std(1) # by row m.std()Â # for all elements
Calculate for Pandas Series
pandas.Series is a one-dimensional array with axis labels. It builds on top of numpy.ndarray.
One of its applications is for working with time-series data.
Calculating the sample standard deviation from pandas.Series is easy.
import pandas s = pandas.Series([12, 43, 12, 53]) s.std()
If you need to calculate the population standard deviation, just pass in an additional ddof argument like below.
import pandas s = pandas.Series([12, 43, 12, 53]) s.std(ddof=0)
Calculate for Pandas DataFrame
pandas.DataFrame is a two-dimensional tabular data structure, which allows us to easily perform arithmetic operations on both rows and columns.
Its closest analogy in pure Python is the dict data type.
Let’s create a DataFrame object that represents students’ test scores, as we did in the dict example above.
import pandas scores = { 'Name': ['Kate', 'Alex', 'Cindy'], 'Math Score': [73, 56, 98], 'History Score': [84, 99, 95]} df = pandas.DataFrame(scores)
Now we can calculate sample standard deviations for each subject, namely Math and History. Note that it will be by-row calculations.
import pandas scores = { 'Name': ['Kate', 'Alex', 'Cindy'], 'Math Score': [73, 56, 98], 'History Score': [84, 99, 95]} df = pandas.DataFrame(scores) df.std()
Alternatively, we can compute sample standard deviations by person. For that, we’ll pass an additional axis argument with a value equal to 1. Note, that in this case, it will be by-column calculations.
import pandas scores = { 'Name': ['Kate', 'Alex', 'Cindy'], 'Math Score': [73, 56, 98], 'History Score': [84, 99, 95]} df = pandas.DataFrame(scores) df.std(axis=1)
From the picture above, you can see that Alex has the highest standard deviation of 30.4. It makes sense because the spread in his scores is much bigger compared to Kate and Cindy.
All of the above were sample standard deviations. To compute a population standard deviation, pass an additional ddof argument with a value equal to 0 as usual.
import pandas scores = { 'Name': ['Kate', 'Alex', 'Cindy'], 'Math Score': [73, 56, 98], 'History Score': [84, 99, 95]} df = pandas.DataFrame(scores) df.std(ddof=0)
In the following three sections, we will focus on telling the differences between standard deviation and other statistical aggregate measures such as mean, average, and median.
Standard deviation vs. mean (average)
As mentioned above, the standard deviation is a measure of how spread out numbers in a data set are. Another interpretation of standard deviation is how far each element in a data set is from the mean value of this data set.
What is the mean? The mean number is just an abstract concept that tries to estimate an average value in a data set. It’s obtained by summing up all numbers in a data set and dividing the result by the quantity of these numbers (i.e., the size of the data set).
Below is an example of how you would obtain a mean number for a data set. You can also see that the standard deviation value for this data set is quite different from its mean value.
dataset = [2, 4, 5, 1, 6] mean = sum(dataset) / len(dataset) print(mean) import statistics std_dev = statistics.stdev(dataset) print(std_dev)
Standard deviation vs. median
Median is another aggregate measure in statistics. It is meant to express the notion of an average number. However, it’s different from the mean number.
Imagine that you have a data set, and you arranged all numbers in this data set in non-decreasing order. For example [1, 2, 4, 5, 6].
You can see that four falls right into the middle of this sorted data set. Such number, which stands in the middle of a data set after we have arranged it in non-decreasing order, is called the median value of this data set.
If the size of the data set is even, as in [1, 2, 4, 5, 6, 7], you will end up having two numbers in the middle, in this case, 4 and 5. In such a case, you compute the median value as the mean value of these two numbers, i.e., 4.5 in this example.
Below is an example of calculating the median value for a data set. Note that it’s again quite different from the standard deviation.
import statistics odd_dataset = [2, 4, 5, 1, 6] odd_median = statistics.median(odd_dataset) print(odd_median) even_dataset = [2, 4, 5, 1, 6, 7] even_median = statistics.median(even_dataset) print(even_median) odd_std_dev = statistics.stdev(odd_dataset) print(odd_std_dev) even_std_dev = statistics.stdev(even_dataset) print(even_std_dev)
Pooled standard deviation
Sometimes, when you have multiple samples of your data, you will want to estimate the standard deviation of your population using all of those sample standard deviations. This is the scenario when the pooled standard deviation comes handy.
The pooled standard deviation is just a weighted average of all of your sample standard deviations. The more items there are in a sample, the more weight you give to this sample’s standard deviation in the computation of the pooled standard deviation.
Below is an example of how one can compute the pooled standard deviation.
import math import statistics sample1 = [1, 2, 3] sample2 = [1, 2, 3, 10, 20, 30] s1 = statistics.stdev(sample1) print(s1) s2 = statistics.stdev(sample2) print(s2) pooled_std = math.sqrt(((len(sample1) - 1) * (s1 ** 2) + (len(sample2) - 1) * (s2 ** 2)) / (len(sample1) - 1 + len(sample2) - 1)) print(pooled_std)
Plot standard deviation and error bars
If you want to plot statistical data in Python, you can use the matplotlib 2D plotting library.
You install matplotlib using pip3 install matplotlib command in your terminal.
pip3 install matplotlib
Let’s create a plot of mean values of students’ scores by subject and use standard deviation for showing our confidence ranges (also known as error bars).
Let’s prepare the data as in the example below. Compute means and standard deviations of scores by subject.
import statistics math_scores = [73, 56, 98, 23, 14] history_scores = [84, 99, 95, 34, 10] english_scores = [89, 98, 99, 67, 56] math_mean = statistics.mean(math_scores) history_mean = statistics.mean(history_scores) english_mean = statistics.mean(english_scores) math_stdev = statistics.stdev(math_scores) history_stdev = statistics.stdev(history_scores) english_stdev = statistics.stdev(english_scores) x = [0, 1, 2] y = [math_mean, history_mean, english_mean] yerr = [math_stdev, history_stdev, english_stdev]
Then plug x, y and yerr as inputs to matplotlib.pyplot.errorbar() function. matplotlib.pyplot.show() will then display a nice error bar chart.
import mathplotlib.pyplot as plot plot.errorbar(x, means, std, linestyle='None', marker='^') plot.show()
I hope you find the tutorial useful. Keep coming back.
Mokhtar is the founder of LikeGeeks.com. He works as a Linux system administrator since 2010. He is responsible for maintaining, securing, and troubleshooting Linux servers for multiple clients around the world. He loves writing shell and Python scripts to automate his work.