Description
Read the following article, and, in your own words, write at least 150 words describing various approaches to feature selection as well as 50 words on points of interest (i.e., interesting to you) from the ‘checklist’ provided in the articleLinks to an external site. (200 words total):
https://machinelearningmastery.com/an-introduction-to-feature-selectionLinks to an external site.
Read the adapted article (reproduced below), and perform the feature selection walk-throughs using the code provided. Retype or copy/paste the code into your own Jupyter notebook (Anaconda or a cloud-based tool such as Google CoLabLinks to an external site.).
Submit a single Word document (doc/docx) with:
your responses to #1 (min 200 words)
a screenshot of your code and output from EVERY one of the four (4) walk-throughs in #2 (so at least four (4) screenshots, more if needed)
Feature Selection
Adapted from an article by Jason Brownlee
Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested.
Having irrelevant features in your data can decrease the accuracy of many models, especially linear algorithms like linear and logistic regression.
Three benefits of performing feature selection before modeling your data are:
Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
Improves Accuracy: Less misleading data means modeling accuracy improves.
Reduces Training Time: Less data means that algorithms train faster.
You can learn more about feature selection with scikit-learn in the article Feature selectionLinks to an external site..
Feature Selection for Machine Learning
We will explore four feature selection methods using the same dataset–Pima Indians onset of diabetes.
This is a binary classification problem where all of the features are numeric.
Dataset fileLinks to an external site. (no need to download–the code provides the URL)
Dataset detailsLinks to an external site. (you should read through this to understand the dataset)
1. UNIVARIATE SELECTION
Statistical tests can be used to select those features that have the strongest relationship with the output/target variable.
The scikit-learn library includes SelectKBestLinks to an external site., which provides multiple options using various statistical tests to select a specific number (k) of features.
For example the ANOVA (ANalysis Of VAriance) F-value method is appropriate for numerical inputs and categorical targets, as in the Pima dataset. This can be used via the f_classif()Links to an external site. function. We will select the four (k=4) best features with this method.
# Feature Selection with Univariate Statistical Tests
import pandas as pd
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
# load data
url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv”
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
dataframe = pd.read_csv(url, names=names)
dataset = dataframe.values
# input features
X = dataset[:,0:8]
# target
Y = dataset[:,8]
# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, Y)
# summarize scores– show the name of the feature and its associated score
for i in range(len(fit.scores_)):
print(names[i], round(fit.scores_[i],1))
# apply the transformation to the input features–i.e., remove the columns that didn’t make the cut
features = fit.transform(X)
# comparisons:
# names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
# [ 39.67 213.162 3.257 4.304 13.281 71.772 23.871 46.141]
# create a new DataFrame with only the columns retained
new_X = pd.DataFrame(features, columns=[‘preg’,’plas’,’mass’,’age’])
# output first few rows of data for selected features
new_X.head()
For help on which statistical measure to use for your data, see the tutorial:
How to Choose a Feature Selection Method For Machine LearningLinks to an external site.
Note: Your results may vary given the stochastic (“random”) nature of the algorithm or evaluation procedure. Consider running the example a few times and compare the average outcome.
You can see the scores for each attribute and the four (k=4) attributes chosen (those with the highest scores). Specifically, features with indexes 0 (preq), 1 (plas), 5 (mass), and 7 (age).
[ 39.67 213.162 3.257 4.304 13.281 71.772 23.871 46.141]
[[ 6. 148. 33.6 50. ]
[ 1. 85. 26.6 31. ]
[ 8. 183. 23.3 32. ]
[ 1. 89. 28.1 21. ]
[ 0. 137. 43.1 33. ]]
2. RECURSIVE FEATURE ELIMINATION
Recursive Feature Elimination (or RFE) works by recursively removing attributes and building a model on the remaining attributes.
It uses model accuracy to identify which attributes (and combination of attributes) contribute most to predicting the target variable.
You can learn more about the RFELinks to an external site. class in the scikit-learn documentation.
The example below uses RFE with the logistic regression algorithm to select the top 3 features. The choice of algorithm does not matter too much for our example.
# Feature Extraction with RFE
import pandas as pd
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv”
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
dataframe = pd.read_csv(url, names=names)
dataset = dataframe.values
# input features
X = dataset[:,0:8]
# target
Y = dataset[:,8]
# feature extraction
model = LogisticRegression(solver=’lbfgs’, max_iter=200)
rfe = RFE(model, n_features_to_select=3)
fit = rfe.fit(X, Y)
# output results
print(“Num Features: %d” % fit.n_features_)
print(“Selected Features: %s” % fit.support_)
print(“Feature Ranking: %s” % fit.ranking_)
# ‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’
# Selected Features: [ True True False False False True True False]
# Feature Ranking: [ 1 1 3 5 4 1 1 2]
# SelectKBest [ 4 1 8 7 6 2 5 3]
# create a new DataFrame
new_X = pd.DataFrame(X, columns=names[0:8])
# go through columns and create a list of the column names to drop
drop_cols = []
for i in range(len(new_X.columns)):
if fit.support_[i] == False: drop_cols.append(names[i])
# remove columns/features not selected
new_X.drop(columns=drop_cols, axis=1, inplace=True)
# output data for selected features
new_X.head()
Note: Your results may vary given the stochastic (“random”) nature of the algorithm or evaluation procedure. Consider running the example a few times and compare the average outcome.
You can see that RFE chooses the the top three (3) features as preg, mass and pedi, as their positions are marked “True” in the support_ array and “1” in the ranking_ array.
Num Features: 3
Selected Features: [ True False False False False True True False ]
Feature Ranking: [1 2 3 5 6 1 1 4]
3. PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (or PCA) uses linear algebra to transform the dataset into a compressed form, projecting the dimensions into lower-dimensional space (e.g., 3d to 2d).
PCA is a data reduction technique. A property of PCA is that you can choose the number of dimensions or “principal components” in the transformed result.
The example below uses PCA and selects three (3) principal components.
Learn more about the PCA class in scikit-learn by reviewing the PCALinks to an external site. documentation. Dive deeper into the math behind PCA on the Principal Component Analysis Wikipedia articleLinks to an external site..
# Feature Extraction with PCA
import numpy
import pandas as pd
from sklearn.decomposition import PCA
# load data
url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv”
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
dataframe = pd.read_csv(url, names=names)
dataset = dataframe.values
# input features
X = dataset[:,0:8]
# target
Y = dataset[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print(“Explained variance by component: %s” % fit.explained_variance_ratio_)
# show new, derived features (aka components)
fit.transform(X)
Note: Your results may vary given the stochastic (“random”) nature of the algorithm or evaluation procedure. Consider running the example a few times and compare the average outcome.
You can see that the transformed dataset (3 principal components) bear little resemblance to the source data. This is due to the ‘compression’ into lower-dimensional space (from 8d to 3d).
Explained Variance: [ 0.88854663 0.06159078 0.02579012]
[[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02
9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]
[ 2.26488861e-02 9.72210040e-01 1.41909330e-01 -5.78614699e-02
-9.46266913e-02 4.69729766e-02 8.16804621e-04 1.40168181e-01]
[ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01
2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]
4. FEATURE IMPORTANCE
Bagged decision trees like Random Forests and Extra Trees can be used to estimate the importance of features.
The example below uses an ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset. You can learn more about the ExtraTreesClassifierLinks to an external site. class in the scikit-learn API.
# Feature Importance with Extra Trees Classifier
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
# load data
url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv”
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
dataframe = pd.read_csv(url, names=names)
dataset = dataframe.values
# input features
X = dataset[:,0:8]
# target
Y = dataset[:,8]
# feature extraction
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X, Y)
# output feature importances
print(model.feature_importances_)
# create a new DataFrame with only the columns retained, based on feature importance rank
new_X = pd.DataFrame(X[:,[1,5,6,7]], columns=[‘plas’,’mass’,’pedi’,’age’])
# output data for selected features
new_X.head()
Note: Your results may vary given the stochastic (“random”) nature of the algorithm or evaluation procedure. Consider running the example a few times and compare the average outcome.
You get an importance score for each feature–where a larger score indicates greater importance. The scores indicate the relative importance of plas, mass and age.
[ 0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537 0.12654214 0.15415431]
Summary
You explored feature selection for preparing machine learning data in Python with the library scikit-learn using four specific methods:
Univariate Selection
Recursive Feature Elimination
Principle Component Analysis
Feature Importance