Statistical
Learning
Measures of
central tendency, dispersion, and correlation
Outline
- Raw Data
- Frequency Distribution - Histograms
- Cumulative Frequency Distribution
- Measures of Central Tendency
- Mean, Median, Mode
- Measures of Dispersion
- Range, IQR, Standard Deviation, coefficient of variation
- Normal distribution, Chebyshev Rule.
- Five number summary, boxplots, QQ plots, Quantilplot, scatter plot.
- Visualization: scatter plot matrix.
- Correlation analysis
Data versus Information
When analysts
are bewildered by plethora of data, which do not make any sense on the surface of it, they are looking for
methods to classify data that would
convey meaning. The idea here
is to help them draw the right conclusion. Data needs to be arranged
into information.
Raw Data
Frequency distribution focuses on classifying raw data into information. It is a widely used
data reduction technique in descriptive statistics.
distribution.
X-
axis represents the classes and the Y-axis
represents the frequencies in bars
Histogram depicts the pattern of the distribution emerging from the characteristic
being measured.
Histogram- Example
Histogram-
Example
The
inspection records of a hose assembly operation revealed a high level of rejection. An analysis of the
records showed that the "leaks"
were a major contributing factor to the problem. It was
decided to investigate the hose clamping operation. The hose clamping force (torque) was measured on twenty-five assemblies.
(Figures in foot-pounds). The data are given below: Draw the frequency
histogram and comment.
8
|
13
|
15
|
10
|
16
|
11
|
14
|
11
|
14
|
20
|
15
|
16
|
12
|
15
|
13
|
12
|
13
|
16
|
17
|
17
|
14
|
14
|
14
|
18
|
15
|
Histogram Example Solution
Cumulative Frequency
Distribution
A type of
frequency distribution that shows how
many observations are above or
below the lower boundaries of the classes. You can formulate the
following from the previous example of hose clamping force(torque)
What is Central Tendency?
Whenever you
measure things of the same kind, a fairly large number of such measurements
will tend to cluster around the middle value. Such a value is called a measure
of "Central Tendency". The other terms that are used synonymously are
"Measures of Location", or "Statistical Averages".
Arithmetic Mean
Arithmetic Mean (called mean) is defined as the sum of all observations in a data set divided by the total number of observations. For example, consider a data set containing the following observations:
In symbolic form mean is given by
X = ∑ X /n
X= Arthmetic mean
∑ X = Indicates sum all X values in the data set
n = Total
number of observations(Sample Size)
X = ∑ X /n
X= Arthmetic mean
Arithmetic Mean -Example
The inner diameter of a
particular grade of tire based on 5 sample measurements are as follows:
(figures in millimeters)
565, 570, 572, 568, 585
formula......... X = ∑ X /n
We get mean = (565+570+572+568+585)/5 =572
Caution:
Arithmetic Mean is affected by extreme values or fluctuations in sampling. It
is not the best average to use when the
data set contains
extreme values (Very high or very low values).
Median
Median is the
middlemost observation when you arrange
data in
ascending order of
magnitude. Median is
such that 50% of the observations are above the median and 50% of the observations are below the median.
Median is a
very useful measure for ranked data in the context of consumer preferences and
rating. It is not affected by extreme values (greater resistance to outliers)
Median = n +1/2 th value of ranked data
n
n = Number of observations in the sample
Median- Example
Marks obtained by 7
students in Computer Science Exam are given below: Compute the median.
45 40 60 80 90 65 55
Arranging the data after ranking gives
90 80 65 60 55 45 40
Median = (n+1)/2 th value in this set = (7+1)/2 th
observation= 4th observation=60
Hence Median = 60 for this problem.
Mode
Mode is
that value which
occurs most often.
It has the maximum
frequency of occurrence.
Mode also has resistance to outliers.
A mode is a very
useful measure when you want to keep in the inventory, the most
popular shirt in
terms of collar size during the festive season.
Mode -Example
The life in the number of hours of 10 flashlight batteries are as follows: Find
the mode.
340
|
350
|
340
|
340
|
320
|
340
|
330
|
330
|
340
|
350
|
340 occurs five times. Hence, mode=340.
Comparison of Mean, Median, Mode
Mean
|
Median
|
Mode
|
Defined
as the arithmetic
average
of all observations
in
the data set.
Requires measurement on
all observations.
Uniquely
and
comprehensively
defined.
|
Defined
as the
middle value in the data
set arranged in ascending or descending order.
Does not require
measurement on all observations
Cannot be uniquely
determined under all conditions.
.
|
Defined
as the most
frequently occurring
value in the distribution; it has the largest frequency.
Does not require
measurement on all observations
Not uniquely defined for
multi-modal situations.
|
Comparison of Mean, Median,
Mode Cont.
Mean
|
Median
|
Mode
|
Affected
by extreme
values.
It can be treated algebraically. That is, Means of several groups can be combined.
|
Not affected by
extreme values.
It cannot be treated algebraically. That is, Medians of several groups cannot be combined.
|
Not affected by
extreme values.
It cannot be treated algebraically. That is, Modes of several groups cannot be combined.
|
Measures of Dispersion
In simple
terms, measures of
dispersion indicates how large the spread of the distribution is
around the central tendency. It answers unambiguously the question " What
is the magnitude of departure
from the average value for different groups having identical
averages?".
Range
Range is the simplest
of all measures
of dispersion. It is
calculated as the
difference between maximum
and minimum value in the data set.
Range = XMaximum
− XMinimum
Range-Example
Example for Computing Range
The following data represent the percentage return on investment for 10 mutual funds per annum. Calculate Range.
12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9
Range = XMaximum − XMinimum = 18-9=9
Caution:
If one of the
components of range namely the maximum value or
minimum value becomes an extreme
value, then range should not be used.
Inter-Quartile Range(IQR)
IQR= Range
computed on middle 50% of the observations after eliminating the highest and
lowest 25% of observations in a
data set that
is arranged in ascending order. IQR is less affected by outliers.
IQR =Q3-Q1
Interquartile Range-Example
The following data represent
the annual percentage returns of 9
mutual funds.
Data Set: 12, 14, 11, 18, 10.5, 12, 14, 11, 9
Arranging in ascending
order, the data set becomes 9, 10.5, 11, 11, 12, 12, 14, 14, 18
IQR=Q3-Q1=14-10.75=3.25
Standard Deviation
To define
standard deviation, you
need to define
another term called variance. In simple terms, standard deviation is the
square root of variance.
Example of Standard Deviation
The following
data represent the percentage return on investment for 10 mutual funds per annum.
Calculate the sample standard deviation.
12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9
CoefficientvVariation (CV)
is defined as the ratio of Standard Deviation to Mean.
In symbolic form
CV = S / for the sample data
and = σ /
X μ
for the population
Consider two SalesPersons
working in the same territory The sales performance of these two in the context of selling PCs are given below.
Comment on the results.
Sales
Person 1
Mean Sales (One year average) 50 units Standard
Deviation of 5 units
Sales
Person 2
Mean Sales (One year average) m75 units Standard
deviation 25 units
Interpretation for the Example
The CV is 5/50 =0.10 or 10% for the Sales Person1 and 25/75=0.33 or 33% for sales Person2.
The Empirical
Rule
•
The
empirical rule approximates the variation of data in a
bell-shaped distribution
•
Approximately
68% of the data in a bell-shaped distribution
is within 1 standard deviation of the mean or
The Empirical Rule
Chebyshev Rule
Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within k standard deviations of the mean (for k > 1)
For Example, when k=2, at least 75% of the values of any
Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within k standard deviations of the mean (for k > 1)
For Example, when k=2, at least 75% of the values of any
Distribution Shape
Graphic Displays of Basic Statistical Descriptions
Boxplot: graphic display of five-number summary
Histogram: x-axis are values, y-axis repres. frequencies
Quantile
plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are
≤ xi
Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of another
Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
Histograms Often Tell More than
No comments:
Post a Comment