A Gentle Introduction to Statistical Data Distributions
A sample of data will form a distribution, and by far the most well-known distribution is the Gaussian distribution, often called the Normal distribution.
The distribution provides a parameterized mathematical function that can be used to calculate the probability for any individual observation from the sample space. This distribution describes the grouping or the density of the observations, called the probability density function. We can also calculate the likelihood of an observation having a value equal to or lesser than a given value. A summary of these relationships between observations is called a cumulative density function.
In this tutorial, you will discover the Gaussian and related distribution functions and how to calculate probability and cumulative density functions for each.
After completing this tutorial, you will know:
- A gentle introduction to standard distributions to summarize the relationship of observations.
- How to calculate and plot probability and density functions for the Gaussian distribution.
- The Student t and Chi-squared distributions related to the Gaussian distribution.
Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.
Let’s get started.
Tutorial Overview
This tutorial is divided into 4 parts; they are:
- Distributions
- Gaussian Distribution
- Student’s t-Distribution
- Chi-Squared Distribution
Distributions
From a practical perspective, we can think of a distribution as a function that describes the relationship between observations in a sample space.
For example, we may be interested in the age of humans, with individual ages representing observations in the domain, and ages 0 to 125 the extent of the sample space. The distribution is a mathematical function that describes the relationship of observations of different heights.
A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are arranged in order from smallest to largest and then they can be presented graphically.
Many data conform to well-known and well-understood mathematical functions, such as the Gaussian distribution. A function can fit the data with a modification of the parameters of the function, such as the mean and standard deviation in the case of the Gaussian.
Once a distribution function is known, it can be used as a shorthand for describing and calculating related quantities, such as likelihoods of observations, and plotting the relationship between observations in the domain.
Density Functions
Distributions are often described in terms of their density or density functions.
Density functions are functions that describe how the proportion of data or likelihood of the proportion of observations change over the range of the distribution.
Two types of density functions are probability density functions and cumulative density functions.
- Probability Density function: calculates the probability of observing a given value.
- Cumulative Density function: calculates the probability of an observation equal or less than a value.
A probability density function, or PDF, can be used to calculate the likelihood of a given observation in a distribution. It can also be used to summarize the likelihood of observations across the distribution’s sample space. Plots of the PDF show the familiar shape of a distribution, such as the bell-curve for the Gaussian distribution.
Distributions are often defined in terms of their probability density functions with their associated parameters.
A cumulative density function, or CDF, is a different way of thinking about the likelihood of observed values. Rather than calculating the likelihood of a given observation as with the PDF, the CDF calculates the cumulative likelihood for the observation and all prior observations in the sample space. It allows you to quickly understand and comment on how much of the distribution lies before and after a given value. A CDF is often plotted as a curve from 0 to 1 for the distribution.
Both PDFs and CDFs are continuous functions. The equivalent of a PDF for a discrete distribution is called a probability mass function, or PMF.
Next, let’s look at the Gaussian distribution and two other distributions related to the Gaussian that you will encounter when using statistical methods. We will look at each in turn in terms of their parameters, probability, and cumulative density functions.
Gaussian Distribution
The Gaussian distribution, named for Carl Friedrich Gauss, is the focus of much of the field of statistics.
Data from many fields of study surprisingly can be described using a Gaussian distribution, so much so that the distribution is often called the “normal” distribution because it is so common.
A Gaussian distribution can be described using two parameters:
- mean: Denoted with the Greek lowercase letter mu, is the expected value of the distribution.
- variance: Denoted with the Greek lowercase letter sigma raised to the second power (because the units of the variable are squared), describes the spread of observation from the mean.
It is common to use a normalized calculation of the variance called the standard deviation
- standard deviation: Denoted with the Greek lowercase letter sigma, describes the normalized spread of observations from the mean.
We can work with the Gaussian distribution via the norm SciPy module. The norm.pdf() function can be used to create a Gaussian probability density function with a given sample space, mean, and standard deviation.
The example below creates a Gaussian PDF with a sample space from -5 to 5, a mean of 0, and a standard deviation of 1. A Gaussian with these values for the mean and standard deviation is called the Standard Gaussian.
Running the example creates a line plot showing the sample space in the x-axis and the likelihood of each value of the y-axis. The line plot shows the familiar bell-shape for the Gaussian distribution.
The top of the bell shows the most likely value from the distribution, called the expected value or the mean, which in this case is zero, as we specified in creating the distribution.
The norm.cdf() function can be used to create a Gaussian cumulative density function.
The example below creates a Gaussian CDF for the same sample space.
Running the example creates a plot showing an S-shape with the sample space on the x-axis and the cumulative probability of the y-axis.
We can see that a value of 2 covers close to 100% of the observations, with only a very thin tail of the distribution beyond that point.
We can also see that the mean value of zero shows 50% of the observations before and after that point.
Student’s t-Distribution
The Student’s t-distribution, or just t-distribution for short, is named for the pseudonym “Student” by William Sealy Gosset.
It is a distribution that arises when attempting to estimate the mean of a normal distribution with different sized samples. As such, it is a helpful shortcut when describing uncertainty or error related to estimating population statistics for data drawn from Gaussian distributions when the size of the sample must be taken into account.
Although you may not use the Student’s t-distribution directly, you may estimate values from the distribution required as parameters in other statistical methods, such as statistical significance tests.
The distribution can be described using a single parameter:
- number of degrees of freedom: denoted with the lowercase Greek letter nu (v), denotes the number degrees of freedom.
Key to the use of the t-distribution is knowing the desired number of degrees of freedom.
The number of degrees of freedom describes the number of pieces of information used to describe a population quantity. For example, the mean has n degrees of freedom as all n observations in the sample are used to calculate the estimate of the population mean. A statistical quantity that makes use of another statistical quantity in its calculation must subtract 1 from the degrees of freedom, such as the use of the mean in the calculation of the sample variance.
Observations in a Student’s t-distribution are calculated from observations in a normal distribution in order to describe the interval for the populations mean in the normal distribution. Observations are calculated as:
Where x is the observations from the Gaussian distribution, mean is the average observation of x, S is the standard deviation and n is the total number of observations. The resulting observations form the t-observation with (n – 1) degrees of freedom.
In practice, if you require a value from a t-distribution in the calculation of a statistic, then the number of degrees of freedom will likely be n – 1, where n is the size of your sample drawn from a Gaussian distribution.
Which specific distribution you use for a given problem depends on the size of your sample.
— Page 93, Statistics in Plain English, Third Edition, 2010.
SciPy provides tools for working with the t-distribution in the stats.t module. The t.pdf() function can be used to create a Student t-distribution with the specified degrees of freedom.
The example below creates a t-distribution using the sample space from -5 to 5 and (10,000 – 1) degrees of freedom.
Running the example creates and plots the t-distribution PDF.
We can see the familiar bell-shape to the distribution much like the normal. A key difference is the fatter tails in the distribution, highlighting the increased likelihood of observations in the tails compared to that of the Gaussian.
The t.cdf() function can be used to create the cumulative density function for the t-distribution. The example below creates the CDF over the same range as above.
Running the example, we see the familiar S-shaped curve as we see with the Gaussian distribution, although with slightly softer transitions from zero-probability to one-probability for the fatter tails.
Chi-Squared Distribution
The chi-squared distribution is denoted as the lowecase Greek letter chi (X) raised to the second power (X^2).
Like the Student’s t-distribution, the chi-squared distribution is also used in statistical methods on data drawn from a Gaussian distribution to quantify the uncertainty. For example, the chi-squared distribution is used in the chi-squared statistical tests for independence. In fact, the chi-squared distribution is used in the derivation of the Student’s t-distribution.
The chi-squared distribution has one parameter:
- degrees of freedom, denoted k.
An observation in a chi-squared distribution is calculated as the sum of k squared observations drawn from a Gaussian distribution.
Where chi is an observation that has a chi-squared distribution, x are observation drawn from a Gaussian distribution, and k is the number of x observations which is also the number of degrees of freedom for the chi-squared distribution.
Again, as with the Student’s t-distribution, data does not fit a chi-squared distribution; instead, observations are drawn from this distribution in the calculation of statistical methods for a sample of Gaussian data.
SciPy provides the stats.chi2 module for calculating statistics for the chi-squared distribution. The chi2.pdf() function can be used to calculate the chi-squared distribution for a sample space between 0 and 50 with 20 degrees of freedom. Recall that the sum squared values must be positive, hence the need for a positive sample space.
Running the example calculates the chi-squared PDF and presents it as a line plot.
With 20 degrees of freedom, we can see that the expected value of the distribution is just short of the value 20 on the sample space. This is intuitive if we think most of the density in the Gaussian distribution lies between -1 and 1 and then the sum of the squared random observations from the standard Gaussian would sum to just under the number of degrees of freedom, in this case 20.
Although the distribution has a bell-like shape, the distribution is not symmetric.
The chi2.cdf() function can be used to calculate the cumulative density function over the same sample space.
Running the example creates a plot of the cumulative density function for the chi-squared distribution.
The distribution helps to see the likelihood for the chi-squared value around 20 with the fat tail to the right of the distribution that would continue on long after the end of the plot.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
- Recreate the PDF and CDF plots for one distribution with a new sample space.
- Calculate and plot the PDF and CDF for the Cauchy and Laplace distributions.
- Look up and implement the equations for the PDF and CDF for one distribution from scratch.
Articles
- Probability density function on Wikipedia
- Cumulative distribution function on Wikipedia
- Probability mass function on Wikipedia
- Normal distribution on Wikipedia
- Student’s t-distribution on Wikipedia
- Chi-squared distribution on Wikipedia
Summary
In this tutorial, you discovered the Gaussian and related distribution functions and how to calculate probability and cumulative density functions for each.
Specifically, you learned:
- A gentle introduction to standard distributions to summarize the relationship of observations.
- How to calculate and plot probability and density functions for the Gaussian distribution.
The Student t and Chi-squared distributions related to the Gaussian distribution.
Continuous Probability Distributions for Machine Learning
The probability for a continuous random variable can be summarized with a continuous probability distribution.
Continuous probability distributions are encountered in machine learning, most notably in the distribution of numerical input and output variables for models and in the distribution of errors made by models. Knowledge of the normal continuous probability distribution is also required more generally in the density and parameter estimation performed by many machine learning models.
As such, continuous probability distributions play an important role in applied machine learning and there are a few distributions that a practitioner must know about.
In this tutorial, you will discover continuous probability distributions used in machine learning.
After completing this tutorial, you will know:
- The probability of outcomes for continuous random variables can be summarized using continuous probability distributions.
- How to parametrize, define, and randomly sample from common continuous probability distributions.
- How to create probability density and cumulative density plots for common continuous probability distributions.
Tutorial Overview
This tutorial is divided into four parts; they are:
- Continuous Probability Distributions
- Normal Distribution
- Exponential Distribution
- Pareto Distribution
Continuous Probability Distributions
A random variable is a quantity produced by a random process.
A continuous random variable is a random variable that has a real numerical value.
Each numerical outcome of a continuous random variable can be assigned a probability.
The relationship between the events for a continuous random variable and their probabilities is called the continuous probability distribution and is summarized by a probability density function, or PDF for short.
Unlike a discrete random variable, the probability for a given continuous random variable cannot be specified directly; instead, it is calculated as an integral (area under the curve) for a tiny interval around the specific outcome.
The probability of an event equal to or less than a given value is defined by the cumulative distribution function, or CDF for short. The inverse of the CDF is called the percentage-point function and will give the discrete outcome that is less than or equal to a probability.
- PDF: Probability Density Function, returns the probability of a given continuous outcome.
- CDF: Cumulative Distribution Function, returns the probability of a value less than or equal to a given outcome.
- PPF: Percent-Point Function, returns a discrete value that is less than or equal to the given probability.
There are many common continuous probability distributions. The most common is the normal probability distribution. Practically all continuous probability distributions of interest belong to the so-called exponential family of distributions, which are just a collection of parameterized probability distributions (e.g. distributions that change based on the values of parameters).
Continuous probability distributions play an important role in machine learning from the distribution of input variables to the models, the distribution of errors made by models, and in the models themselves when estimating the mapping between inputs and outputs.
In the following sections, will take a closer look at some of the more common continuous probability distributions.
Normal Distribution
The normal distribution is also called the Gaussian distribution (named for Carl Friedrich Gauss) or the bell curve distribution.
The distribution covers the probability of real-valued events from many different problem domains, making it a common and well-known distribution, hence the name “normal.” A continuous random variable that has a normal distribution is said to be “normal” or “normally distributed.”
Some examples of domains that have normally distributed events include:
- The heights of people.
- The weights of babies.
- The scores on a test.
The distribution can be defined using two parameters:
- Mean (mu): The expected value.
- Variance (sigma^2): The spread from the mean.
Often, the standard deviation is used instead of the variance, which is calculated as the square root of the variance, e.g. normalized.
- Standard Deviation (sigma): The average spread from the mean.
A distribution with a mean of zero and a standard deviation of 1 is called a standard normal distribution, and often data is reduced or “standardized” to this for analysis for ease of interpretation and comparison.
We can define a distribution with a mean of 50 and a standard deviation of 5 and sample random numbers from this distribution. We can achieve this using the normal() NumPy function.
The example below samples and prints 10 numbers from this distribution.
Running the example prints 10 numbers randomly sampled from the defined normal distribution.
A sample of data can be checked to see if it is random by plotting it and checking for the familiar normal shape, or by using statistical tests. If the samples of observations of a random variable are normally distributed, then they can be summarized by just the mean and variance, calculated directly on the samples.
We can calculate the probability of each observation using the probability density function. A plot of these values would give us the tell-tale bell shape.
We can define a normal distribution using the norm() SciPy function and then calculate properties such as the moments, PDF, CDF, and more.
The example below calculates the probability for integer values between 30 and 70 in our distribution and plots the result, then does the same for the cumulative probability.
Running the example first calculates the probability for integers in the range [30, 70] and creates a line plot of values and probabilities.
The plot shows the Gaussian or bell-shape with the peak of highest probability around the expected value or mean of 50 with a probability of about 8%.
The cumulative probabilities are then calculated for observations over the same range, showing that at the mean, we have covered about 50% of the expected values and very close to 100% after the value of about 65 or 3 standard deviations from the mean (50 + (3 * 5)).
In fact, the normal distribution has a heuristic or rule of thumb that defines the percentage of data covered by a given range by the number of standard deviations from the mean. It is called the 68-95-99.7 rule, which is the approximate percentage of the data covered by ranges defined by 1, 2, and 3 standard deviations from the mean.
For example, in our distribution with a mean of 50 and a standard deviation of 5, we would expect 95% of the data to be covered by values that are 2 standard deviations from the mean, or 50 – (2 * 5) and 50 + (2 * 5) or between 40 and 60.
We can confirm this by calculating the exact values using the percentage-point function.
The middle 95% would be defined by the percentage point function value for 2.5% at the low end and 97.5% at the high end, where 97.5 – 2.5 gives the middle 95%.
The complete example is listed below.
Running the example gives the exact outcomes that define the middle 95% of expected outcomes that are very close to our standard-deviation-based heuristics of 40 and 60.
An important related distribution is the Log-Normal probability distribution.
Exponential Distribution
The exponential distribution is a continuous probability distribution where a few outcomes are the most likely with a rapid decrease in probability to all other outcomes.
It is the continuous random variable equivalent to the geometric probability distribution for discrete random variables.
Some examples of domains that have exponential distribution events include:
- The time between clicks on a Geiger counter.
- The time until the failure of a part.
- The time until the default of a loan.
The distribution can be defined using one parameter:
- Scale (Beta): The mean and standard deviation of the distribution.
Sometimes the distribution is defined more formally with a parameter lambda or rate. The beta parameter is defined as the reciprocal of the lambda parameter (beta = 1/lambda)
- Rate (lambda) = Rate of change in the distribution.
We can define a distribution with a mean of 50 and sample random numbers from this distribution. We can achieve this using the exponential() NumPy function.
The example below samples and prints 10 numbers from this distribution.
Running the example prints 10 numbers randomly sampled from the defined distribution.
We can define an exponential distribution using the expon() SciPy function and then calculate properties such as the moments, PDF, CDF, and more.
The example below defines a range of observations between 50 and 70 and calculates the probability and cumulative probability for each and plots the result.
Running the example first creates a line plot of outcomes versus probabilities, showing a familiar exponential probability distribution shape.
Next, the cumulative probabilities for each outcome are calculated and graphed as a line plot, showing that after perhaps a value of 55 that almost 100% of the expected values will be observed.
An important related distribution is the double exponential distribution, also called the Laplace distribution.
Pareto Distribution
A Pareto distribution is named after Vilfredo Pareto and is may be referred to as a power-law distribution.
It is also related to the Pareto principle (or 80/20 rule) which is a heuristic for continuous random variables that follow a Pareto distribution, where 80% of the events are covered by 20% of the range of outcomes, e.g. most events are drawn from just 20% of the range of the continuous variable.
The Pareto principle is just a heuristic for a specific Pareto distribution, specifically the Pareto Type II distribution, that is perhaps most interesting and on which we will focus.
Some examples of domains that have Pareto distributed events include:
- The income of households in a country.
- The total sales of books.
- The scores by players on a sports team.
The distribution can be defined using one parameter:
- Shape (alpha): The steepness of the decease in probability.
Values for the shape parameter are often small, such as between 1 and 3, with the Pareto principle given when alpha is set to 1.161.
We can define a distribution with a shape of 1.1 and sample random numbers from this distribution. We can achieve this using the pareto() NumPy function.
Running the example prints 10 numbers randomly sampled from the defined distribution.
We can define a Pareto distribution using the pareto() SciPy function and then calculate properties, such as the moments, PDF, CDF, and more.
The example below defines a range of observations between 1 and about 10 and calculates the probability and cumulative probability for each and plots the result.
Running the example first creates a line plot of outcomes versus probabilities, showing a familiar Pareto probability distribution shape.
Next, the cumulative probabilities for each outcome are calculated and graphed as a line plot, showing a rise that is less steep than the exponential distribution seen in the previous section.
Summary
In this tutorial, you discovered continuous probability distributions used in machine learning.
Specifically, you learned:
- The probability of outcomes for continuous random variables can be summarized using continuous probability distributions.
- How to parametrize, define, and randomly sample from common continuous probability distributions.
- How to create probability density and cumulative density plots for common continuous probability distributions.
No comments:
Post a Comment