# Python standard deviation tutorial

The standard deviation allows you to measure how spread out numbers in a data set are. Large values of standard deviations show that elements in a data set are spread further apart from their mean value. In this tutorial, we will calculate the standard deviation using Python.

Small standard deviations show that items donâ€™t deviate significantly from the mean value of a data set.

**Table of Contents**hide

## Terminology

There are two standard deviation notions in statistics.

One is theÂ ** population standard deviation**. It computes the spread directly from all values in a population. You use it when the values you have at hand represent the entire population.

Another one is theÂ ** sample standard deviation**. It tries to estimate the population spread by using only a sample subset of values. You use it when the values you have at hand represent just a subset of the entire population.

The sample standard deviation is an approximate measure. Itâ€™s useful because frequently, the data population is too big, and all we can only directly measure a randomized sample of it.

Population and sample standard deviations are calculated using slightly different algorithms. Hence, when programming, you should always keep in mind which one you want to compute and call the appropriate APIs.

## Standard deviation in Python

Since version 3.x Python includes a light-weightÂ statisticsÂ module in a default distribution, this module provides a lot of useful functions for statistical computations.

There is also a full-featured statistics packageÂ NumPy, which is especially popular among data scientists.

The latter has more features but also represents a more massive dependency in your code.

## Calculate for a list

ComputingÂ **sample**Â standard deviation on a list of values in Python can be accomplished with theÂ statistics.stdev()Â function.

import statistics statistics.stdev([5.12, -34.11, 32.43, -1.3, 7.83, -0.32])

**Population**Â standard deviation is computed using slightly different functionÂ statistics.pstdev().

import statistics statistics.pstdev([5.12, -34.11, 32.43, -1.3, 7.83, -0.32])

In the examples which follow, weâ€™ll be showing how to applyÂ statistics.stdev()Â function to different Python data types. If you need to calculate the population standard deviation, useÂ statistics.pstdev()Â function instead. The rest of the code must be identical.

Another option to compute a standard deviation for a list of values in Python is to use a NumPy scientific package.

It doesnâ€™t come with Python by default, and you need to install it separately. The usual way of installing third-party packages in Python is to use aÂ Python package installerÂ pip.

pip3 install numpy

After you installed NumPy, computing the standard deviation is trivial. Note thatÂ numpy.stdÂ computesÂ **population**Â standard deviation by default.

import numpy numpy.std([5.12, -34.11, 32.43, -1.3, 7.83, -0.32])

If you want to compute aÂ **sample**Â standard deviation using the NumPy package, you will have to pass an additional argumentÂ ddofÂ with a value ofÂ 1.Â ddofÂ stands forÂ *delta degrees of freedom*, which is a statistical notion used in estimating statistics of populations from samples of them.

import numpy numpy.std([5.12, -34.11, 32.43, -1.3, 7.83, -0.32], ddof=1)

## Calculate for an array

If you work with large data sets, Python arrays may be more convenient than its more popular lists.

You can also perform an arithmetic operation on array variables as if they were singular values. In that case, an arithmetic operation will be applied to each value in an array independently.

In the example below, we also pass aÂ dÂ argument to anÂ arrayÂ constructor to indicate that our values are of typeÂ double.

import statistics from array import array statistics.pstdev(array('d', [5.12, -34.11, 32.43, -1.3, 7.83, -0.32]))

numpy.stdÂ works onÂ arrayÂ values as well.

import numpy from array import array numpy.std(array('d', [5.12, -34.11, 32.43, -1.3, 7.83, -0.32]), ddof=1)

## Calculate for dictionary values

Sometimes your data is stored in a key-value data structure like the PythonÂ dict, rather than a sequential data structure like a list.

For example, you can have a data structure, which maps students to their test scores, as in the picture below.

If you want to compute a standard deviation of the test scores across all students, you can do it by callingÂ statistics.pstdevÂ on the dictionary values, without the keys. For that, call Pythonâ€™s built-inÂ dict.values()Â function.

import statistics scores = {'Kate': 73, 'Alex': 56, 'Cindy': 98} statistics.pstdev(scores.values())

## Calculate for a matrix

For dealing with matrices, it is best to resort to the NumPy package. NumPy provides aÂ numpy.matrixÂ data type specifically designed for working with matrices.

Letâ€™s generate a square 4Ă—4 matrix.

import numpy m = numpy.matrix('4 7 2 6, 3 6 2 6, 0 0 1 3, 4 6 1 3')

With matrices, there are three ways to compute standard deviations.

You can compute standard deviations by column (numpy.matrix.std(0)), by row (numpy.matrix.std(1)) or for all elements, as if the matrix was a vector (numpy.matrix.std()).

import numpy m = numpy.matrix('4 7 2 6; 3 6 2 6; 0 0 1 3; 4 6 1 3') m.std(0) # by column m.std(1) # by row m.std()Â # for all elements

## Calculate for Pandas Series

pandas.SeriesÂ is a one-dimensional array with axis labels. It builds on top ofÂ numpy.ndarray.

One of its applications is for working with time-series data.

Calculating the **sample**Â standard deviation fromÂ pandas.SeriesÂ is easy.

import pandas s = pandas.Series([12, 43, 12, 53]) s.std()

If you need to calculate the **population**Â standard deviation, just pass in an additionalÂ ddofÂ argument like below.

import pandas s = pandas.Series([12, 43, 12, 53]) s.std(ddof=0)

## Calculate for Pandas DataFrame

pandas.DataFrame is a two-dimensional tabular data structure, which allows us to easily perform arithmetic operations on both rows and columns.

Its closest analogy in pure Python is theÂ dictÂ data type.

Letâ€™s create aÂ DataFrameÂ object that represents studentsâ€™ test scores, as we did in theÂ dictÂ example above.

import pandas scores = { 'Name': ['Kate', 'Alex', 'Cindy'], 'Math Score': [73, 56, 98], 'History Score': [84, 99, 95]} df = pandas.DataFrame(scores)

Now we can calculateÂ **sample**Â standard deviations for each subject, namely Math and History. Note that it will be by-row calculations.

import pandas scores = { 'Name': ['Kate', 'Alex', 'Cindy'], 'Math Score': [73, 56, 98], 'History Score': [84, 99, 95]} df = pandas.DataFrame(scores) df.std()

Alternatively, we can computeÂ **sample** standard deviations by person. For that, weâ€™ll pass an additionalÂ axisÂ argument with a value equal toÂ 1. Note, that in this case, it will be by-column calculations.

import pandas scores = { 'Name': ['Kate', 'Alex', 'Cindy'], 'Math Score': [73, 56, 98], 'History Score': [84, 99, 95]} df = pandas.DataFrame(scores) df.std(axis=1)

From the picture above, you can see that Alex has the highest standard deviation of 30.4. It makes sense because the spread in his scores is much bigger compared to Kate and Cindy.

All of the above wereÂ **sample**Â standard deviations. To compute aÂ **population**Â standard deviation, pass an additionalÂ ddofÂ argument with a value equal toÂ 0Â as usual.

import pandas scores = { 'Name': ['Kate', 'Alex', 'Cindy'], 'Math Score': [73, 56, 98], 'History Score': [84, 99, 95]} df = pandas.DataFrame(scores) df.std(ddof=0)

In the following three sections, we will focus on telling the differences between standard deviation and other statistical aggregate measures such as mean, average, and median.

## Standard deviation vs. mean (average)

As mentioned above, the standard deviation is a measure of how spread out numbers in a data set are. Another interpretation of standard deviation is how far each element in a data set is from theÂ **mean**Â value of this data set.

What is theÂ **mean**? The mean number is just an abstract concept that tries to estimate an average value in a data set. Itâ€™s obtained by summing up all numbers in a data set and dividing the result by the quantity of these numbers (i.e., the size of the data set).

Below is an example of how you would obtain a mean number for a data set. You can also see that the standard deviation value for this data set is quite different from its mean value.

dataset = [2, 4, 5, 1, 6] mean = sum(dataset) / len(dataset) print(mean) import statistics std_dev = statistics.stdev(dataset) print(std_dev)

## Standard deviation vs. median

**Median**Â is another aggregate measure in statistics. It is meant to express the notion of anÂ *average*Â number. However, itâ€™s different from theÂ *mean*Â number.

Imagine that you have a data set, and you arranged all numbers in this data set in non-decreasing order. For example [1, 2, 4, 5, 6].

You can see thatÂ fourÂ falls right into the middle of this sorted data set. Such number, which stands in the middle of a data set after we have arranged it in non-decreasing order, is called theÂ **median**Â value of this data set.

If the size of the data set is even, as inÂ [1, 2, 4, 5, 6, 7], you will end up having two numbers in the middle, in this case,Â 4Â andÂ 5. In such a case, you compute theÂ **median**Â value as theÂ *mean*Â value of these two numbers, i.e.,Â 4.5Â in this example.

Below is an example of calculating the median value for a data set. Note that itâ€™s again quite different from the standard deviation.

import statistics odd_dataset = [2, 4, 5, 1, 6] odd_median = statistics.median(odd_dataset) print(odd_median) even_dataset = [2, 4, 5, 1, 6, 7] even_median = statistics.median(even_dataset) print(even_median) odd_std_dev = statistics.stdev(odd_dataset) print(odd_std_dev) even_std_dev = statistics.stdev(even_dataset) print(even_std_dev)

## Pooled standard deviation

Sometimes, when you have multiple samples of your data, you will want to estimate the standard deviation of your population using all of those sample standard deviations. This is the scenario when theÂ **pooled**Â standard deviation comes handy.

TheÂ **pooled**Â standard deviation is just a weighted average of all of yourÂ *sample*Â standard deviations. The more items there are in a sample, the more weight you give to this sampleâ€™s standard deviation in the computation of the pooled standard deviation.

Below is an example of how one can compute the pooled standard deviation.

import math import statistics sample1 = [1, 2, 3] sample2 = [1, 2, 3, 10, 20, 30] s1 = statistics.stdev(sample1) print(s1) s2 = statistics.stdev(sample2) print(s2) pooled_std = math.sqrt(((len(sample1) - 1) * (s1 ** 2) + (len(sample2) - 1) * (s2 ** 2)) / (len(sample1) - 1 + len(sample2) - 1)) print(pooled_std)

## Plot standard deviation and error bars

If you want to plot statistical data in Python, you can use theÂ matplotlibÂ 2D plotting library.

You installÂ *matplotlib*Â usingÂ pip3 install matplotlibÂ command in your terminal.

pip3 install matplotlib

Letâ€™s create a plot of mean values of studentsâ€™ scores by subject and use standard deviation for showing our confidence ranges (also known asÂ *error bars*).

Letâ€™s prepare the data as in the example below. Compute means and standard deviations of scores by subject.

import statistics math_scores = [73, 56, 98, 23, 14] history_scores = [84, 99, 95, 34, 10] english_scores = [89, 98, 99, 67, 56] math_mean = statistics.mean(math_scores) history_mean = statistics.mean(history_scores) english_mean = statistics.mean(english_scores) math_stdev = statistics.stdev(math_scores) history_stdev = statistics.stdev(history_scores) english_stdev = statistics.stdev(english_scores) x = [0, 1, 2] y = [math_mean, history_mean, english_mean] yerr = [math_stdev, history_stdev, english_stdev]

Then plugÂ x,Â yÂ andÂ yerrÂ as inputs toÂ matplotlib.pyplot.errorbar()Â function.Â matplotlib.pyplot.show()Â will then display a nice error bar chart.

import mathplotlib.pyplot as plot plot.errorbar(x, means, std, linestyle='None', marker='^') plot.show()

I hope you find the tutorial useful. Keep coming back.

Mokhtar is the founder of LikeGeeks.com. He works as a Linux system administratorÂ since 2010. He is responsible for maintaining, securing, and troubleshooting Linux servers for multiple clients around the world. He loves writing shell and Python scripts to automate his work.