Data Science: Statistical Learning Measures of central tendency, dispersion, and correlation

Statistical Learning

Measures of central tendency, dispersion, and correlation

Outline

Raw Data
Frequency Distribution - Histograms
Cumulative Frequency Distribution
Measures of Central Tendency
Mean, Median, Mode
Measures of Dispersion
Range, IQR, Standard Deviation, coefficient of variation
Normal distribution, Chebyshev Rule.
Five number summary, boxplots, QQ plots, Quantilplot, scatter plot.
Visualization: scatter plot matrix.
Correlation analysis

Data versus Information

When analysts are bewildered by plethora of data, which do not make any sense on the surface of it, they are looking for methods to classify data that would convey meaning. The idea here is to help them draw the right conclusion. Data needs to be arranged into information.

Raw Data

Raw Data represent numbers and facts in the original format in which the data have been collected. We need to convert the raw data into information for decision making.

Frequency distribution focuses on classifying raw data into information. It is a widely used data reduction technique in descriptive statistics.

HISTOGRAM

Histogram (also known as frequency histogram) is a snapshot of the frequency

distribution.

Histogram is a graphical representation of the frequency distribution in which the

X- axis represents the classes and the Y-axis represents the frequencies in bars

Histogram depicts the pattern of the distribution emerging from the characteristic

being measured.

Histogram- Example

The inspection records of a hose assembly operation revealed a high level of rejection. An analysis of the records showed that the "leaks" were a major contributing factor to the problem. It was decided to investigate the hose clamping operation. The hose clamping force (torque) was measured on twenty-five assemblies. (Figures in foot-pounds). The data are given below: Draw the frequency histogram and comment.

8	13	15	10	16
11	14	11	14	20
15	16	12	15	13
12	13	16	17	17
14	14	14	18	15

Histogram Example Solution

Cumulative Frequency Distribution

A type of frequency distribution that shows how many observations are above or below the lower boundaries of the classes. You can formulate the following from the previous example of hose clamping force(torque)

What is Central Tendency?

Whenever you measure things of the same kind, a fairly large number of such measurements will tend to cluster around the middle value. Such a value is called a measure of "Central Tendency". The other terms that are used synonymously are "Measures of Location", or "Statistical Averages".

Arithmetic Mean

Arithmetic Mean (called mean) is defined as the sum of all observations in a data set divided by the total number of observations. For example, consider a data set containing the following observations:

In symbolic form mean is given by

X = ∑ X /n

X= Arthmetic mean

∑ X = Indicates sum all X values in the data set

n = Total number of observations(Sample Size)

Arithmetic Mean -Example

The inner diameter of a particular grade of tire based on 5 sample measurements are as follows: (figures in millimeters)

565, 570, 572, 568, 585

formula......... X = ∑ X /n

We get mean = (565+570+572+568+585)/5 =572

Caution: Arithmetic Mean is affected by extreme values or fluctuations in sampling. It is not the best average to use when the data set contains extreme values (Very high or very low values).

Median

Median is the middlemost observation when you arrange data in ascending order of magnitude. Median is such that 50% of the observations are above the median and 50% of the observations are below the median.

Median is a very useful measure for ranked data in the context of consumer preferences and rating. It is not affected by extreme values (greater resistance to outliers)

Median = n +1/2 th value of ranked data

n = Number of observations in the sample

Median- Example

Marks obtained by 7 students in Computer Science Exam are given below: Compute the median.

45 40 60 80 90 65 55

Arranging the data after ranking gives

90 80 65 60 55 45 40

Median = (n+1)/2 th value in this set = (7+1)/2 th

observation= 4th observation=60 Hence Median = 60 for this problem.

Mode

Mode is that value which occurs most often. It has the maximum frequency of occurrence. Mode also has resistance to outliers.

A mode is a very useful measure when you want to keep in the inventory, the most popular shirt in terms of collar size during the festive season.

Mode -Example

The life in the number of hours of 10 flashlight batteries are as follows: Find the mode.

340	350	340	340	320	340	330	330
340	350

340 occurs five times. Hence, mode=340.

Comparison of Mean, Median, Mode

Mean	Median	Mode
Defined as the arithmetic average of all observations in the data set. Requires measurement on all observations. Uniquely and comprehensively defined.	Defined as the middle value in the data set arranged in ascending or descending order. Does not require measurement on all observations Cannot be uniquely determined under all conditions. .	Defined as the most frequently occurring value in the distribution; it has the largest frequency. Does not require measurement on all observations Not uniquely defined for multi-modal situations.

Comparison of Mean, Median, Mode Cont.

Mean	Median	Mode
Affected by extreme values. It can be treated algebraically. That is, Means of several groups can be combined.	Not affected by extreme values. It cannot be treated algebraically. That is, Medians of several groups cannot be combined.	Not affected by extreme values. It cannot be treated algebraically. That is, Modes of several groups cannot be combined.

Measures of Dispersion

In simple terms, measures of dispersion indicates how large the spread of the distribution is around the central tendency. It answers unambiguously the question " What is the magnitude of departure from the average value for different groups having identical averages?".

Range

Range is the simplest of all measures of dispersion. It is calculated as the difference between maximum and minimum value in the data set.

Range = XMaximum − XMinimum

Range-Example

Example for Computing Range

The following data represent the percentage return on investment for 10 mutual funds per annum. Calculate Range.

12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9

Range = XMaximum − XMinimum = 18-9=9

Caution: If one of the components of range namely the maximum value or minimum value becomes an extreme value, then range should not be used.

Inter-Quartile Range(IQR)

IQR= Range computed on middle 50% of the observations after eliminating the highest and lowest 25% of observations in a data set that is arranged in ascending order. IQR is less affected by outliers.

IQR =Q₃-Q₁

Interquartile Range-Example

The following data represent the annual percentage returns of 9 mutual funds.

Data Set: 12, 14, 11, 18, 10.5, 12, 14, 11, 9

Arranging in ascending order, the data set becomes 9, 10.5, 11, 11, 12, 12, 14, 14, 18

IQR=Q₃-Q₁=14-10.75=3.25

Standard Deviation

To define standard deviation, you need to define another term called variance. In simple terms, standard deviation is the square root of variance.

Example of Standard Deviation

The following data represent the percentage return on investment for 10 mutual funds per annum. Calculate the sample standard deviation.

12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9

Solution for the Example

Coefficient of Variation (Relative Dispersion)

CoefficientvVariation (CV) is defined as the ratio of Standard Deviation to Mean.

In symbolic form

CV = S / for the sample data and = σ /

X μ

for the population

Coefficient of Variation Example

Consider two SalesPersons working in the same territory The sales performance of these two in the context of selling PCs are given below. Comment on the results.

Sales Person 1

Mean Sales (One year average) 50 units Standard Deviation of 5 units

Sales Person 2

Mean Sales (One year average) m75 units Standard deviation 25 units

Interpretation for the Example

The CV is 5/50 =0.10 or 10% for the Sales Person1 and 25/75=0.33 or 33% for sales Person2.

The moral of the story is "don't get carried away by averages. Consider variation (“risk”)

The Empirical Rule

• The empirical rule approximates the variation of data in a

bell-shaped distribution

• Approximately 68% of the data in a bell-shaped distribution

is within 1 standard deviation of the mean or

μ ± 1σ

The Empirical Rule

• Approximately 95% of the data in a bell-shaped distribution lies within two standard deviations of the mean, or µ ± 2σ

Approximately 99.7% of the data in a bell-shaped distribution lies within three standard deviations of the mean, or µ ± 3σ

Chebyshev Rule
Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within k standard deviations of the mean (for k > 1)

For Example, when k=2, at least 75% of the values of any

data set will be within μ ± 2σ

The Five Number Summary

Distribution Shape

Graphic Displays of Basic Statistical Descriptions

Boxplot: graphic display of five-number summary

Histogram: x-axis are values, y-axis repres. frequencies

Quantile plot: each value x_i is paired with f_i indicating that approximately 100 f_i % of data are ≤ x_i

Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another

Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane

Histograms Often Tell More than

Data Science

Tuesday, April 21, 2020