Tuesday, April 21, 2020

Statistical Learning Measures of central tendency, dispersion, and correlation



 Statistical Learning
Measures of central tendency, dispersion, and correlation

Outline

  1. Raw Data
  2. Frequency Distribution - Histograms
  3.   Cumulative Frequency Distribution
  4. Measures of Central Tendency
  5. Mean, Median, Mode
  6. Measures of Dispersion
  7. Range, IQR, Standard Deviation, coefficient of variation
  8.   Normal distribution, Chebyshev Rule.
  9.    Five number summary, boxplots, QQ plots, Quantilplot, scatter plot.
  10.  Visualization: scatter plot matrix.
  11.  Correlation analysis


                                    Data versus Information

When analysts are bewildered by plethora of data, which do not make any sense on the surface of it, they are looking for methods to classify data that would  convey  meaning.  The  idea  here  is to help them draw the right conclusion. Data needs to be arranged into information.


                                                  Raw Data
 Raw  Data represent numbers and facts in the original format in which the data have been collected. We need to convert the raw data into information for decision making.
 Frequency distribution focuses on classifying raw data into information. It is a widely used data reduction technique in descriptive statistics.


                                                  HISTOGRAM
       Histogram (also known as frequency histogram) is a snapshot of the frequency
distribution.
         Histogram is a graphical representation of the frequency distribution in which the
X-   axis represents the classes and the Y-axis represents the frequencies in bars

Histogram depicts the pattern of the distribution emerging from the characteristic
being measured.

Histogram- Example  

                                                                                                                             Histogram- Example

 The  inspection records of a  hose assembly operation revealed a high level of rejection. An analysis of the records showed that the "leaks"  were a major contributing factor to the problem. It was decided to investigate the hose clamping operation.  The hose clamping force (torque) was measured on twenty-five assemblies. (Figures in foot-pounds). The data are given below: Draw the frequency histogram and comment.
8
13
15
10
16
11
14
11
14
20
15
16
12
15
13
12
13
16
17
17
14
14
14
18
15

Histogram Example Solution



Cumulative Frequency Distribution
A type of frequency distribution that  shows  how  many  observations are above or below the lower boundaries of the classes. You can formulate  the  following  from  the previous example of  hose clamping force(torque)


What is Central Tendency?



 Whenever you measure things of the same kind, a fairly large number of such measurements will tend to cluster around the middle value. Such a value is called a measure of "Central Tendency". The other terms that are used synonymously are "Measures of Location", or "Statistical Averages".



Arithmetic Mean

Arithmetic Mean (called mean) is defined as the sum of all observations in a data set divided by the total number of observations. For example, consider a data set containing the following observations:

In symbolic form mean is given by


 = X  /n
                          X= Arthmetic mean
              ∑ X              = Indicates sum all X values in the data set

n             = Total number of observations(Sample Size)


Arithmetic Mean -Example
The inner diameter of a particular grade of tire based on 5 sample measurements are as follows: (figures in millimeters)
565, 570, 572, 568, 585

      formula.........                   X   X  /n

We get mean = (565+570+572+568+585)/5 =572

 Caution: Arithmetic Mean is affected by extreme values or fluctuations in sampling. It is not the best average to use  when  the  data  set  contains  extreme values  (Very high or very low values).

                                            Median
Median is the middlemost observation when you  arrange data  in  ascending  order  of  magnitude.  Median  is   such that 50% of the observations are above the median and 50% of the observations are below the median.

Median is a very useful measure for ranked data in the context of consumer preferences and rating. It is not affected by extreme values (greater resistance to outliers)
Median = n +1/2           th value of ranked data
n
n = Number of observations in the sample

Median- Example

Marks obtained by 7 students in Computer Science Exam are given below: Compute the median.
 45               40               60               80               90               65               55
 Arranging the data after ranking gives
 90               80               65               60               55               45               40
 Median = (n+1)/2 th value in this set = (7+1)/2 th
observation= 4th observation=60 Hence Median = 60 for this problem.


Mode

Mode is that  value  which  occurs most often.  It  has  the maximum  frequency of occurrence.  Mode also has resistance to outliers.

A mode is a very useful measure when you want to keep in the inventory, the  most  popular  shirt  in  terms of collar size during the festive season.


Mode -Example
 The life in the number of hours of  10 flashlight batteries are as follows: Find the mode.
340
350
340
340
320
340
330
330
340
350







340 occurs five times. Hence, mode=340.

 Comparison of Mean, Median, Mode

Mean
Median
Mode
Defined as the arithmetic
average of all observations
in the data set.


Requires measurement on
all observations.

Uniquely and
comprehensively defined.

Defined as the
middle value in the data set arranged in ascending or descending order.

Does not require measurement on all observations

Cannot be uniquely determined under all conditions.
.
Defined as the most
frequently occurring value in the distribution; it has the largest frequency.

Does not require measurement on all observations

Not uniquely defined for multi-modal situations.



 Comparison of Mean, Median, Mode Cont.



Mean
Median
Mode
Affected by extreme
values.


It can be treated algebraically. That is, Means of several groups can be combined.
Not affected by
extreme values.

It cannot be treated algebraically. That is, Medians of several groups cannot be combined.
Not affected by
extreme values.

It cannot be treated algebraically. That is, Modes of several groups cannot be combined.



Measures of Dispersion



 In  simple  terms,  measures  of  dispersion indicates how large the spread of the distribution is around the central tendency. It answers unambiguously the question " What is the magnitude  of  departure  from the average value for different groups having identical averages?".


Range
 Range  is  the  simplest  of  all  measures  of dispersion.  It  is calculated   as   the   difference    between    maximum  and  minimum value in the data set.


Range = XMaximum XMinimum

Range-Example

Example for Computing Range

The following data represent the percentage return on investment for 10 mutual funds per annum. Calculate Range.

12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9


 Range =          XMaximum − XMinimum      = 18-9=9


Caution: If one of the components of range namely the maximum value or  minimum value  becomes  an extreme  value, then range should not be used.

 Inter-Quartile Range(IQR)

IQR= Range computed on middle 50% of the observations after eliminating the highest and lowest 25% of observations  in  a  data  set  that  is  arranged   in ascending order. IQR is less affected by outliers.

IQR =Q3-Q1

 Interquartile Range-Example
 The following data represent the annual  percentage returns of 9 mutual funds.
 Data Set: 12, 14, 11, 18, 10.5, 12, 14, 11, 9
Arranging in ascending order, the data set becomes 9, 10.5, 11, 11, 12, 12, 14, 14, 18

 IQR=Q3-Q1=14-10.75=3.25

 Standard Deviation

To  define  standard  deviation,  you  need  to  define  another term called variance. In simple terms, standard deviation is the square root of variance.





 Example of Standard Deviation

The following data represent the percentage return on investment for 10 mutual funds per annum. Calculate the sample standard deviation.
 12, 14, 11, 18, 10.5, 11.3, 12, 14, 11, 9


Solution for the Example
Coefficient of Variation (Relative Dispersion)


CoefficientvVariation (CV) is defined as the ratio of Standard Deviation to Mean.
In symbolic form



CV =          S  /      for the sample data and =                                    σ /
X                                                                                     μ

for the population


 Coefficient of Variation Example

Consider two SalesPersons working in the same territory The sales performance of these two in the context of selling PCs are given below. Comment on the results.

 Sales Person 1
Mean Sales (One year average) 50 units Standard Deviation of  5 units

Sales Person 2  
Mean Sales (One year average) m75 units Standard deviation 25 units


Interpretation for the Example

The CV is 5/50 =0.10 or 10% for the Sales Person1 and 25/75=0.33 or 33% for sales Person2.


The moral of the story is "don't get carried away by averages. Consider variation (“risk”)

 The Empirical Rule

           The empirical rule approximates the variation of data in a
bell-shaped distribution
          Approximately 68% of the data in a bell-shaped distribution
is within 1 standard deviation of the mean or
μ ± 1σ




The Empirical Rule                   
          Approximately 95% of the data in a bell-shaped distribution lies within two standard deviations of the mean, or µ ±  


  • Approximately 99.7% of the data in a bell-shaped distribution lies within three standard deviations of the mean, or µ ±




 Chebyshev Rule
Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within k standard deviations of the mean (for k > 1)

 For Example, when k=2, at least 75% of the values of any
data set will be within        μ ±





The Five Number Summary


Distribution Shape



















Graphic Displays of Basic Statistical Descriptions



Boxplot: graphic display of five-number summary 
Histogram: x-axis are values, y-axis repres. frequencies 
Quantile plot:     each value xi is paired with fi indicating that approximately 100 fi % of data                                     are ≤ xi
Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution against the corresponding quantiles of another
Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane

Histograms Often Tell More than







No comments:

Post a Comment