Monday, April 20, 2020

machine learning statistics

 machine-learning statistics 

·        Why Statistics

·        Statistical Methods
·        Types of Statistics - Descriptive and Inferential §  Statistics
·        Data Sources and Types of Datasets
·        Attributes of Datasets
Why Statistics is So Important?
Three significant events triggered the current meteoric growth in the use of analytical decision making and Statistics is central to all of them.

Event1         
Technological developments, Revolution of Internet and social networks, data generated from mobile phones and other electronic devices, produce a large amount of data from which insights will have to be sifted.

The discovery of pattern and trends from these data for organizations will pave the way for improving profitability, understanding customer expectations, and appropriately pricing their products so that they can gain a competitive advantage in the marketplace.






Event 2
Advances in enormous computing power to effectively process and analyze massive amounts of data
Sophisticated and faster algorithms for solving problems
Data Visualization for Business Intelligence an
Artificial Intelligence
Event 3
 Large data storage capability
 Parallel computing, and cloud computing coupled with better computer hardware have enabled businesses and other organizations to solve large scale problems faster than ever before without  sacrificing

Big data
         A set of data that cannot be managed, processed or analyzed with traditional software/algorithms within a reasonable amount of time.

         Big data revolves around
Volume Velocity                      Variety Value Veracity

Walmart handles over one million purchase transactions
per hour.

Facebook processes more than 250 million picture uploads
per day.

Statistics - Methods

Classification

  Classification techniques help in segmenting the customers  into
appropriate groups based on key characteristics.

    For example, using appropriate statistical model, an organization could easily segment the customers into Long Term Customers, Medium Term Customers, and Brand Switchers.

   Another application in this context is classifying customers into
“Buyers and Non-Buyers.”

       •   Classification helps professionals understand the customer behaviour and position their products and brands using appropriate strategies.




Pattern Recognition
         “A picture is worth thousand words” and it reveals hidden pattern in the data that could be leveraged by retail professionals. Pattern recognition techniques include Histogram, Box Plot, Scatter Plot and other Visual Analytics.
          For example, histogram drawn for income of a particular class of customers may reveal a symmetrical bell curve pattern or may be left or right skewed.
          Relationship between age and expenditure could be captured
using a scatter plot.
          Box Plot enables identification of outliers (extreme points) apart from providing the distribution pattern.


Association

         Association Analysis helps in determining which of the items go together. Association rules include a set of analytics that focuses on discovering relationships that exist among specific objects.

         In this context, market basket analysis refers to an association rule
that generates the probability for an outcome.

         For example, market basket analysis may lead to a finding that if customers buy coffee, there is a 40% probability that they also buy bread.

         Association rules can be adapted by organizations to store lay out   and sales promotion decisions


Predictive Modeling

         Both customer segmentation as well as identifying and targeting most profitable customers can be facilitated by predictive models.

         Regression can be used for predicting the amount of expenditure on a particular product based on input variables income, age, and gender.

         Organizations can leverage on other advanced models that comprise Logistic Regression, and Neural Networks for predicting a target variable as well as classifying and predicting into which group the consumer belongs to.

         For example, these models can classify and predict buyers and
non-buyers, and defaulters and non-defaulters on credit card loan.

               Classical Definition of Statistics


By Statistics, we mean methods specially adopted to the elucidation of quantitative data affected to a marked extent by a multiplicity of causes”.
Yule and Kendal

It is interesting to see what Thomas Davenport means by Business Analytics and note the similarities and dissimilarities between the two.

“Business Analytics (BA) can be defined as the broad use of data and quantitative analysis for decision making within organizations”.



Types of Statistics



 Descriptive Statistics is concerned with Data Summarization, Graphs/Charts, and Tables

Inferential Statistics is the method used to talk about a Population Parameter from a Sample.


 Population, Parameter, Sample, Statistic
 A Population is the universe of possible data for a specified object. Example:
People who have visited or will visit a website.

 A Parameter is a numerical value associated with a population. Example: The average amount of time people spend on a website.


A Sample is a selection of observations from a population. Example: People (or IP addresses) who visited a website on a specific day.

A Statistic is a numerical value associated with an observed sample. Example: The average amount of time people spent on a website on a specific day.

Data Sources

Primary  Data  are   collected   by   the   organization  itself  for a particular purpose. The benefits of primary data are that they fit the needs exactly, are up to date, and reliable.

Secondary Data are collected by other organizations or for other purposes. Any data, which are not collected by the organization for the specified purpose, are secondary data. These may be published by other organizations, available from research studies, published by the government, web, social media and so on.


Types of Data


Qualitative  Data  are  nonnumeric in  nature   and   can't    be measured. Examples are gender, religion, and place of birth.

Quantitative  Data  are  numerical  in   nature   and    can    be measured. Examples are balance in your savings bank account, and number of members in your family.

Quantitative data can be classified into discrete type or  continuous  type. Discrete type can take only certain values, and there are discontinuities   between   values, such as the number of rooms in a hotel, which cannot be in fraction. Continuous  type  can   take   any value  within  a  specific  interval,  such as the production quantity of a particular   type of paper (measured in kilograms).

Types of Data sets

        Record
            Relational records
            Data matrix, e.g., numerical matrix,
crosstabs                                       

            Document data: text documents: term-

        Graph and network
            World Wide Web
            Social or information networks
            Molecular Structures
        Ordered
                     Video data:sequence of images
  •           Temporal data: time-series
  •            Sequential Data: transaction sequences
  •             Genetic sequence data
  • Spatial, image and multimedia:


            Spatial data: maps
  •            Image data
  •           Video data



Data Objects

              Data sets are made up of data objects.
              A data object represents an entity.
              Examples:
              sales database: customers, store items, sales
              medical database: patients, treatments
              university database: students, professors, courses
              Also called samples , examples, instances, data points, objects, tuples.
              Data objects are described by attributes.
              Database rows -> data objects; columns ->attributes.

Attributes

        Attribute (or dimensions, features, variables): a data field,
representing a characteristic or feature of a data object.
            E.g., customer _ID, name, address
        Types:
            Nominal
            Binary
            Ordinal
            Numeric: quantitative
           Interval-scaled
           Ratio-scaled

Attribute Types


              Nominal: categories, states, or “names of things”
              Hair_color = {auburn, black, blond, brown, grey, red, white}
              marital status, occupation, ID numbers, zip codes
              Binary
              Nominal attribute with only 2 states (0 and 1)
              Symmetric binary: both outcomes equally important
              e.g., gender
              Asymmetric binary: outcomes not equally important.
              e.g., medical test (positive vs. negative)
              Convention: assign 1 to most important outcome (e.g., HIV positive)
              Ordinal
              Values have a meaningful order (ranking) but magnitude between
successive values is not known.
              Size = {small, medium, large}, grades, army rankings

No comments:

Post a Comment