Friday, May 1, 2020

The 5 Feature Selection Algorithms every Data Scientist should know

The 5 Feature Selection Algorithms every Data Scientist should know

Data Science is the study of algorithms.
I grapple through with many algorithms on a day to day basis, so I thought of listing some of the most common and most used algorithms one will end up using in this new DS Algorithm series.
How many times it has happened when you create a lot of features and then you need to come up with ways to reduce the number of features.
We sometimes end up using correlation or tree-based methods to find out the important features.
Can we add some structure to it?
This post is about some of the most common feature selection techniques one can use while working with data.

Why Feature Selection?

Before we proceed, we need to answer this question. Why don’t we give all the features to the ML algorithm and let it decide which feature is important?
So there are three reasons why we don’t:

1. Curse of dimensionality — Overfitting

If we have more columns in the data than the number of rows, we will be able to fit our training data perfectly, but that won’t generalize to the new samples. And thus we learn absolutely nothing.

2. Occam’s Razor:

We want our models to be simple and explainable. We lose explainability when we have a lot of features.

3. Garbage In Garbage out:

Most of the times, we will have many non-informative features. For Example, Name or ID variables. Poor-quality input will produce Poor-Quality output.
Also, a large number of features make a model bulky, time-taking, and harder to implement in production.

So What do we do?

We select only useful features.
Fortunately, Scikit-learn has made it pretty much easy for us to make the feature selection. There are a lot of ways in which we can think of feature selection, but most feature selection methods can be divided into three major buckets
  • Filter based: We specify some metric and based on that filter features. An example of such a metric could be correlation/chi-square.
  • Wrapper-based: Wrapper methods consider the selection of a set of features as a search problem. Example: Recursive Feature Elimination
  • Embedded: Embedded methods use algorithms that have built-in feature selection methods. For instance, Lasso and RF have their own feature selection methods.
So enough of theory let us start with our five feature selection methods.
We will try to do this using a dataset to understand it better.
I am going to be using a football player dataset to find out what makes a good player great?
Don’t worry if you don’t understand football terminologies. I will try to keep it at a minimum.
Here is the Kaggle Kernel with the code to try out yourself.

Some Simple Data Preprocessing

We have done some basic preprocessing such as removing Nulls and one hot encoding. And converting the problem to a classification problem using:
y = traindf['Overall']>=87
Here we use High Overall as a proxy for a great player.
Our dataset(X) looks like below and has 223 columns.

1. Pearson Correlation

This is a filter-based method.
We check the absolute value of the Pearson’s correlation between the target and numerical features in our dataset. We keep the top n features based on this criterion.
def cor_selector(X, y,num_feats):
    cor_list = []
    feature_name = X.columns.tolist()
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-num_feats:]].columns.tolist()
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature
cor_support, cor_feature = cor_selector(X, y,num_feats)
print(str(len(cor_feature)), 'selected features')

2. Chi-Squared

This is another filter-based method.
In this method, we calculate the chi-square metric between the target and the numerical variable and only select the variable with the maximum chi-squared values.
Let us create a small example of how we calculate the chi-squared statistic for a sample.
So let’s say we have 75 Right-Forwards in our dataset and 25 Non-Right-Forwards. We observe that 40 of the Right-Forwards are good, and 35 are not good. Does this signify that the player being right forward affects the overall performance?
Observed and Expected Counts
We calculate the chi-squared value:
To do this, we first find out the values we would expect to be falling in each bucket if there was indeed independence between the two categorical variables.
This is simple. We multiply the row sum and the column sum for each cell and divide it by total observations.
so Good and NotRightforward Bucket Expected value= 25(Row Sum)*60(Column Sum)/100(Total Observations)
Why is this expected? Since there are 25% notRightforwards in the data, we would expect 25% of the 60 good players we observed in that cell. Thus 15 players.
Then we could just use the below formula to sum over all the 4 cells:
I won’t show it here, but the chi-squared statistic also works in a hand-wavy way with non-negative numerical and categorical features.
We can get chi-squared features from our dataset as:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
X_norm = MinMaxScaler().fit_transform(X)
chi_selector = SelectKBest(chi2, k=num_feats)
chi_selector.fit(X_norm, y)
chi_support = chi_selector.get_support()
chi_feature = X.loc[:,chi_support].columns.tolist()
print(str(len(chi_feature)), 'selected features')

3. Recursive Feature Elimination

This is a wrapper based method. As I said before, wrapper methods consider the selection of a set of features as a search problem.
From sklearn Documentation:
The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
As you would have guessed, we could use any estimator with the method. In this case, we use LogisticRegression, and the RFE observes the coef_ attribute of the LogisticRegression object
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
rfe_selector = RFE(estimator=LogisticRegression(), n_features_to_select=num_feats, step=10, verbose=5)
rfe_selector.fit(X_norm, y)
rfe_support = rfe_selector.get_support()
rfe_feature = X.loc[:,rfe_support].columns.tolist()
print(str(len(rfe_feature)), 'selected features')

4. Lasso: SelectFromModel

This is an Embedded method. As said before, Embedded methods use algorithms that have built-in feature selection methods.
For example, Lasso and RF have their own feature selection methods. Lasso Regularizer forces a lot of feature weights to be zero.
Here we use Lasso to select variables.
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression

embeded_lr_selector = SelectFromModel(LogisticRegression(penalty="l1"), max_features=num_feats)
embeded_lr_selector.fit(X_norm, y)

embeded_lr_support = embeded_lr_selector.get_support()
embeded_lr_feature = X.loc[:,embeded_lr_support].columns.tolist()
print(str(len(embeded_lr_feature)), 'selected features')

5. Tree-based: SelectFromModel

This is an Embedded method. As said before, Embedded methods use algorithms that have built-in feature selection methods.
We can also use RandomForest to select features based on feature importance.
We calculate feature importance using node impurities in each decision tree. In Random forest, the final feature importance is the average of all decision tree feature importance.
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

embeded_rf_selector = SelectFromModel(RandomForestClassifier(n_estimators=100), max_features=num_feats)
embeded_rf_selector.fit(X, y)

embeded_rf_support = embeded_rf_selector.get_support()
embeded_rf_feature = X.loc[:,embeded_rf_support].columns.tolist()
print(str(len(embeded_rf_feature)), 'selected features')
We could also have used a LightGBM. Or an XGBoost object as long it has a feature_importances_ attribute.
from sklearn.feature_selection import SelectFromModel
from lightgbm import LGBMClassifier

lgbc=LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=32, colsample_bytree=0.2,
            reg_alpha=3, reg_lambda=1, min_split_gain=0.01, min_child_weight=40)

embeded_lgb_selector = SelectFromModel(lgbc, max_features=num_feats)
embeded_lgb_selector.fit(X, y)

embeded_lgb_support = embeded_lgb_selector.get_support()
embeded_lgb_feature = X.loc[:,embeded_lgb_support].columns.tolist()
print(str(len(embeded_lgb_feature)), 'selected features')

Bonus

Why use one, when we can have all?
The answer is sometimes it won’t be possible with a lot of data and time crunch.
But whenever possible, why not do this?
# put all selection together
feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embeded_lr_support,
                                    'Random Forest':embeded_rf_support, 'LightGBM':embeded_lgb_support})
# count the selected times for each feature
feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
# display the top 100
feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)
feature_selection_df.index = range(1, len(feature_selection_df)+1)
We check if we get a feature based on all the methods. In this case, as we can see Reactions and LongPassing are excellent attributes to have in a high rated player. And as expected Ballcontrol and Finishing occupy the top spot too.

Thursday, April 30, 2020

Gradient Descent- Understand Completely in a Super Easy Way!


Do you wanna know What is Gradient Descent?, Its role in Neural Network?. Give your few minutes to this blog, to understand the Gradient Descent completely in a super-easy way. You will understand Gradient Descent in a few minutes. So read this full article :-).
Hello, & Welcome!
In this blog, I am gonna tell you-
  1. How does Neural Network Learn?
  2. What is Gradient Descent?
In order to understand Gradient Descent, first, you need to know how a neural network learns?. So first I will explain to you the whole process of neural network learning.
So without wasting your time, let’s get started-

How does Neural Network Learn?

As I have discussed in my previous articles that in Neural Network you only provide the input. And you don’t have a need to feed the features manually. The neural network automatically generates features. That’s the reason Deep Learning is very popular.
Suppose you have to distinguish between dog and cat. So to perform this task in a neural network, you just need to code the architecture. And then you point the neural network at a folder with all dogs and cats images. These images are already categorized. And you tell the neural network that “Ok I have given you the images, now you go and learn by yourself, What a Cat and Dog is?”. So neural network learns by its own. Once it is trained, you give a new image of a dog or cat and the neural network identifies it as a dog or a cat.

So, now we understand how neural network learns?

Here we have a very basic neural network with one layer known as a single layer feedforward neural network or Perceptron. The perceptron was first invented in 1957 by Frank Rosenblat. The whole idea was to create something that can learn and adjust itself.
Here y^ is the predicted value and y is the actual value.
Gradient Descent
So, here we have some input values, that are supplied to the perceptron. Then the activation function is applied, and then we get an output. This output is y^. So the next step is that this predicted output is compared with the actual output y.
Suppose, we draw both outputs on this graph. And see how much they differ with each other.
neural network deep learning
Now we calculate the cost function. This cost function is the square difference between the actual and predicted output. There are different cost function you can use. But the most commonly used cost function is –

cost function= 1/2 square(y – y^)

By calculating the cost function you can identify the error that you have in your prediction. And our goal is to minimize the cost function. The lower the cost function, the closer the predicted output is to actual output.
Gradient Descent

After comparing the y and y^, we feed this information back into the neural network. So the weight gets updated.

Back propagation
Basically, in a neural network, the only thing that we have control is weights. We update the weights and then on these weights the new output is predicted. Again the cost function is calculated and we backpropagate to update the weights. This process continues until we get the predicted output the same or nearby as actual output.
I want to clear one thing that all these processes are happening only for one row. Suppose you have to predict the student percentage based on how much he studies, sleep and quiz percentage. So the study hour, sleep hour, and quiz percentage is the input value for the neural networks. But the whole process of learning happens for one student record something like that-
RowIDStudy HrsSleep HrsQuizExam
110780%90%
So the learning process I have discussed with you is for that one record. Here Study Hrs, Sleep Hrs, and Quiz are the independent variables. Based on these input variables we have to predict the Exam percentage. This 90% is the actual output that is y. So we feed these independent variables to the neural network, activation function is applied and then the output is generated. We compare the predicted output with actual output with the help of cost function. After that, we backpropagate and adjust the weights. This process continues until we get the predicted output nearly y^ which is 90% in that case. Every time weights and y^ are changing.
So this the very simple case where I have shown you with the help of one row.

With Multiple Rows-

Now let’s see if there are multiple rows. Suppose you have a dataset with multiple rows something like that-
Row IdStudy HrsSleep HrsQuizExam
110780%90%
212685%95%
371070%60%
414790%97%
So to understand you properly, I just duplicate the same neural network four times.
They are all the same perceptron, which is important to keep in mind. For each row, y^ is generated. And then we compare with the actual values. For every single row, we have an actual value. Based on the differences between all y^ and y, we can calculate the cost function. This cost function is for a full neural network.
neural network
Based on this cost function, we backpropagate and update the weights. One thing you should keep in mind is that this is one neural network, not 4. This is just to understand you properly. So when we update the weights, we update the weights of one neural network. And the weights which we update are the same for all rows. Don’t think that weights are updated separately to each row.
The same process of learning can be performed with 4 rows as well as 400 rows.

What is Gradient Descent?

As we discussed how neural networks learn?. Its time to know What is Gradient Descent?
The cost function plays an important role in the neural network. So we need to optimize this cost function. The Gradient Descent works on the optimization of the cost function. The one approach is the Brute Force approach, where we take all different possible weights and look at them to find the best one. But it is good if you have fewer weights, as the number of weights increases, or increases the number of synapses, you face the curse of dimensionality.
That’s why Gradient Descent is used to optimize the cost function. So to understand the gradient descent, let’s see in this image.
Suppose you start from this redpoint. So from that point in the top left, we are going to look at the angle of our cost function. Here we are not going to discuss mathematical equations. Basically you just need to differentiate and find out what the slope is in that specific point. And also find out if the slope is positive or negative. If the slope is negative like in the image, that means you are going downhill. so the right is downhill and the left is uphill.

Suppose you go downhill and by rolling the redpoint come somewhere here.

Gradient Descent
Again calculate the slope, that time slope is positive meaning right is uphill and left is downhill so you need to go left. Something like that-
And again you calculate the slope and you need to move right. Here you go to the correct place. That’s how you find the best weights or the best situation that minimize the cost function.
deep learning
of course, it’s not going like a ball rolling. It is going to be a very zigzag type of approach. But it is easy to remember and more fun to look at it as a ball rolling :-). In reality, it’s a step by step approach.
So that’s all about gradient descent. It’s called descent because you are descending or minimizing the cost function.
I hope you understand What is Gradient Descent and How neural network learns. If you have any questions, feel free to ask me in the comment section. Read Stochastic Gradient Descent from here- Stochastic Gradient Descent- A Super Easy Complete Guide!
Enjoy Learning!
All the Best!

Tuesday, April 28, 2020

Hypothesis Testing -


Agenda




          Sampling distribution

          Central Limit Theorem

          Confidence intervals

          Hypothesis Formulation

          Null and Alternative Hypothesis

          Type I and Type II Errors

          Hypothesis Testing

  One-tailed v/s two-tailed test

  Test of mean

  Test of proportion

  Test of variance

          Examples



Concepts of sampling distribution




       Why do we need sampling?

       Analyse the sample and make inferences about the population

       Sample statistic vs population parameter

       Sampling distribution – distribution of a particular sample statistic of all possible samples that can be drawn from a population – sampling distribution of the mean




 Sampling Distribution: CLT




          If n samples are drawn from a population that has a mean µ and standard deviation σ:

          The sampling distribution follows a normal distribution with:

          Mean: µ

          Standard Deviation: σ / √n (also c/a Standard Error)

          The corresponding z-score transformation is:

            If the population is normal, this holds true even for smaller sample sizes.

          However, if the population is not normal, this holds true for sufficiently large sample sizes.



 Central Limit Theorem




         “Sampling Distribution of the mean of any independent random variable will be normal”

         This applies to both discrete and continuous distributions.

         The random variable should have a well defined mean and variance (standard deviation).

         Applicable even when the original variable is not normally distributed.



         Assumptions:

         The data must be randomly sampled.

         The samples values must be independent of each other.

         The 10% condition: When the sample is drawn without replacement, the sample size n, should be no more than 10% of the population.

–  In general, a sample size of 30 is considered sufficient.

         The sample size must be sufficiently large.

–  If the population is skewed, pretty large sample size is needed.

–  For a symmetric population, even small samples are acceptable.

Central Limit Theorem (contd.)

 Assume a dice is rolled in sets of 4 trials and the faces are 

This is repeated for a month (30 days)








 Null and Alternative Hypothesis



All statistical conclusions are made in reference to the null hypothesis.

We either reject the null hypothesis or fail to reject the null hypothesis; we do not accept the null hypothesis.

From the start, we assume the null hypothesis to be true, later the assumption is rejected or we fail to reject it.

        When we reject the null hypothesis, we can conclude that the alternative hypothesis is supported.

        If we fail to reject the null hypothesis, it does not mean that we have proven the null hypothesis is true.

–  Failure to reject the null hypothesis does not equate to proving that it is true.

–  It just holds up our assumption or the status quo.










Type of hypothesis tests



        Single sample or two or more samples

        One tailed or two tailed

        Tests of mean, proportion or variance















Example problem - Single Sample z – test of mean




          You are the manager of a fast food restaurant. You want to determine if the population mean waiting time has changed from the 4.5 minutes. You can assume that the population standard deviation is 1.2 minutes. You select a sample of 25 orders in an hour. Sample mean is 5.1 minutes. Use the relevant hypothesis test to determine if the population mean has changed from the past value of 4.5.








Steps to solve the problem…



          One-tailed or two-tailed

          What is Ho and Ha

          Determine Z and Zstat

          Draw the normal curve

          Reject/Fail to reject Ho?





















Hypothesis Tests using Python


z-test

statsmodels.stats.weightstats.ztest(x1, x2=None, value=0, alternative='two-

sided')

Link to refer -
https://www.statsmodels.org/stable/generated/statsmodels.stats.weightstats.ztest.html

t-test



scipy.stats.ttest_ind(a, b, axis=0, equal_var=True, nan_policy='propagate')

Link to refer -

Chi-square (χ2 ) test



scipy.stats.chisquare(f_obs, f_exp=None)

Link to refer -
F-test


alpha = 0.05 #Or whatever you want your alpha to be.
p_value = scipy.stats.f.cdf(F, df1, df2)
if p_value > alpha: # Reject the null hypothesis that Var(X) == Var(Y)







Hypothesis Testing Using Python




Two Sample Testing

Some important functions:





1.t_statistic, p_value = ttest_ind(group1, group2)

2.u, p_value = mannwhitneyu(group1, group2)

3.t_statistic, p_value = ttest_1samp(post-pre, 0)

4.z_statistic, p_value = wilcoxon(post-pre)


5. levene(pre,post)
6. shapiro(post)















ANOVA- One Way Classification




                The samples drawn from different populations are independent and random.




                The response variables of all the populations are normally distributed.




                The variances of all the populations are equal.



























Hypothesis of One-Way ANOVA



            H0 : µ1 = µ2 = µ3 = µ4 = …= µk



–  All population means are equal




                  H1 : Not all of the population means are equal




–  For at least one pair, the population means are unequal.











solutions and examples next part.....