[Updated: January 07, 2019]
In the previous article, Machine Learning Basics and Perceptron Learning Algorithm, the assumption was that the Iris Data Set trained by Perceptron Learning Algorithm is linear separable, so the number of misclassification on each training iteration eventually converge to 0. Obviously, this is not always the case in the real world. Therefore, there are questions to be asked: what will happen if a data set is not linear separable? How do we handle this scenario?
The previous article also shows that, at least, versicolor and virginica are not linear separable. (See picture below)
(Source code can be found here)
If we use Perceptron Learning Algorithm to train these two types of iris plant, the weight update function, i.e., the loop in the training function, never stops. The iteration ends eventually because we set a limit of the number of iteration, not because the number of misclassification converges to zero. In addition to that, the output of the trained model is not stable. In other words, the output does not seem to be optimal when the iteration meets the limit.
If we run Perceptron Learning Algorithm on the Iris Data Set with both versicolor and virginica, the trend of the number of misclassification looks like the picture below.
(Source code can be found here)
The picture shows that the trend of the number of misclassification jumps up and down.
Note: the minimum in-sample error does not guarantee the learning model performs the same performance on out-of-sample error. If the in-sample error is low, but the out-of-sample error is high, it is called overfitting, which is not the main topic of this article.
In short, when we use a linear model, like Perceptron Learning Algorithm, on a non-separable data set, the following things may happen:
- The update function never stops
- The final output does not guarantee the in-sample error is optimal
One approach to approximate the optimal in-sample error is to extend the Perceptron Learning Algorithm by a simple modification, so-called Pocket Learning Algorithm.
Pocket Learning Algorithm
The idea is very simple: this algorithm keeps the best result seen so far in its pocket (that is why it is called Pocket Learning Algorithm). The best result means the number of misclassification is minimum. If the new weights produce a smaller number of misclassification than the weights in the pocket, then replace the weights in the pocket to the new weights; if the new weights are not better than the one in the pocket, keep the one in the pocket and discard the new weights. At the end of the training iteration, the algorithm returns the solution in the pocket, rather than the last solution.
The Learning Steps
- Initialize the pocket weight vector, $ W_{pocket}$, to 0 or small random numbers, and use this weight vector as the initialized weight vector, $ W_0$ of Perceptron Learning Algorithm.
- For each training iteration, perform the following sub-steps:
- Run the training step of the Perceptron Learning Algorithm to obtain the updated weight vector, $ W_t$, where $ t$ indicates the current iteration.
- Evaluate $ W_t$ by comparing the number of misclassification on the entire sample set with the number of misclassification performed by $ W_{pocket}$.
- If $ W_t$ is better than $ W_{pocket}$, replace $ W_{pocket}$ to $ W_t$.
- Return $ W_{pocket}$ when the training iteration terminates.
The Pocket Learning Algorithm can be simply implemented by extending the Perceptron Learning Algorithm.
from typing import Any, NoReturn
import numpy as np
import pandas as pd
class Pocket:
"""The Pocket class keeps the best weights seen so fat in the
learning process.
Attributes
----------
best_weights: list of float
The list of the best weights seen so far.
misclassify_count: int
The number of misclassification corresponding to the best
weights.
"""
def __init__(self, number_of_attributes: int):
"""Initializer of Pocket.
Parameters
----------
number_of_attributes: int
The number of attributes of the data set.
"""
self.best_weights = np.zeros(number_of_attributes + 1)
self.misclassify_count = -1 # -1 means the class is initialized
# but does not have valid value
class PocketClassifier:
"""Pocket Binary Classifier uses modified Perceptron Learning
Algorithm called Pocket Learning Algorithm to classify two classes
data.
Attributes
----------
pocket: Pocket
The pocket contains the best training result so far and the
number of the misclassified sample according to the result
in the pocket.
weights: list of float
The list of weights corresponding input attributes.
misclassify_record: list of int
The number of misclassification for each training sample.
Methods
-------
train(samples: [[]], labels: [], max_iterator: int = 10)
Train the perceptron learning algorithm with samples.
classify(new_data: [[]]) -> []
Classify the input data.
Examples
--------
Two dimensions list and each sample has four attributes
>>> import pocket_classifier
>>> samples = [[5.1, 3.5, 1.4, 0.2],
[4.9, 3.0, 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5.0, 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[7.0, 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4.0, 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3]]
Binary classes with class -1 or 1.
>>> labels = [-1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1]
>>> pocket_classifier = pocket_classifier.PocketClassifier(4, (-1, 1))
>>> pocket_classifier.train(samples, labels)
>>> new_data = [[6.3, 3.3, 4.7, 1.6], [4.6, 3.4, 1.4, 0.3]]
Predict the class for the new_data
>>> pocket_classifier.classify(new_data)
[1, -1]
"""
def __init__(self, number_of_attributes: int, class_labels: ()):
"""Initializer of PocketClassifier.
Parameters
----------
number_of_attributes: int
The number of attributes of the data set.
class_labels: tuple of the class labels
The class labels can be anything as long as it has
only two types of labels.
"""
# Initialize the Pocket class.
self.pocket = Pocket(number_of_attributes)
# Initialize the weights to zero.
# The size is the number of attributes
# plus the bias, i.e. x_0 * w_0.
self.weights = np.zeros(number_of_attributes + 1)
# Record of the number of misclassify for each training sample.
self.misclassify_record = []
# Build the label map to map the original labels to numerical
# labels. For example, ["a", "b"] -> {0: "a", 1: "b"}
self._label_map = {1: class_labels[0], -1: class_labels[1]}
self._reversed_label_map = {class_labels[0]: 1, class_labels[1]: -1}
def _linear_combination(self, sample: []) -> float:
"""linear combination of sample and weights.
"""
return np.inner(sample, self.weights[1:])
def train(self,
samples: [[]],
labels: [],
max_iterator: int = 10) -> NoReturn:
"""Train the model with samples.
Parameters
----------
samples: two dimensions list
Training data set.
labels: list of labels
The class labels of the training data.
max_iterator: int, optional
The max iterator to stop the training process.
The default is 10.
"""
# Transfer the labels to numerical labels
transferred_labels = [
self._reversed_label_map[index] for index in labels
]
for _ in range(max_iterator):
misclassifies = 0
for sample, target in zip(samples, transferred_labels):
linear_combination = self._linear_combination(sample)
update = target - np.where(linear_combination >= 0.0, 1, -1)
# use numpy.multiply to multiply element-wise
self.weights[1:] += np.multiply(update, sample)
self.weights[0] += update
# record the number of misclassification
misclassifies += int(update != 0.0)
# Update the pocket is the result is better than the one
# in the pocket.
if (self.pocket.misclassify_count == -1) \
or (self.pocket.misclassify_count > misclassifies) \
or (misclassifies == 0):
self.pocket.best_weights = self.weights
self.pocket.misclassify_count = misclassifies
if misclassifies == 0:
break
self.misclassify_record.append(self.pocket.misclassify_count)
def classify(self, new_data: [[]]) -> []:
"""Classify the sample based on the trained weights
Parameters
----------
new_data: two dimensions list
New data to be classified.
Return
------
List of int
The list of predicted class labels.
"""
predicted_result = np.where((self._linear_combination(new_data)
+ self.weights[0]) >= 0.0, 1, -1)
return [self._label_map[item] for item in predicted_result]
(Complete source code can be found here)
Note that this algorithm is straightforward, but not perfect. The results seem stochastic because the optimal result is the best result seen so far.
Apply Pocket Learning Algorithm onto Japanese Credit Screening Data Set
It is always easy to illustrate an idea by an example. This section uses a
However, most machine learning algorithms have a strong assumption that the data set is numerical. Pocket Learning Algorithm is not an exception. For Pocket Learning Algorithm to be able to work on the Japanese Credit Screening Data Set, we need to process the non-numerical features to be numerical. This type of process is usually called feature engineering. The next section provides a brief introduction to feature engineering.
Basic Feature Engineering
The quality of the data and the amount of useful information it consists of are the key points that determine how well the performance of a machine learning algorithm can learn. In other words, using raw data sets directly can negatively affect the performance of a learning process. Therefore, feature engineering normally is the first step in a machine learning process.
Training and Test Sets
Overfitting is a common pitfall of machine learning processes, which could happen when a learning process achieve 0 in-sample error, but a huge out-of-sample error on yet-unseen data. To avoid overfitting, a common practice is that instead of using the entire data set as the training set, the machine learning process holds part of the data as a test data set, and only use the rest of the data to train the model. Once the learning model is trained, we then use the test data set to verify the model’s performance.
In the example that the article, Machine Learning Basics and Perceptron Learning Algorithm, demonstrates, we manually separate the Iris Data Set to a training set and a test set by taking the first 10 samples from each class and aggregate them together to a test set; the rest of samples are training set. In fact, this is not what we normally would do (not what we should do either) regarding split data set. If a data set has been affected at any step in the learning process, its performance to access the outcome has been compromised. This issue is called data snooping. By manually manipulating the data set, it may result in data snooping bias. (Data snooping bias is out of the topic of this article and will be discussed in the future articles.)
The appropriate way to create a training and a test sets is that the test data set should be picked randomly, and meet the following criteria:
- Both the training and the test sets must reflect the original distribution.
- The original data set must be randomly shuffled.
Scikit-learn, an open source machine learning library, provides a nice and convenient function to split a data set into a training and
import urllib.request
# pandas is an open source library providing high-performance,
# easy-to-use data structures and data analysis tools. http://pandas.pydata.org/
import pandas as pd
# scikit-learn is a python machine learning library. train_test_split
# function splits a data set to a train and a test subsets.
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn import model_selection
import perceptron_classifier
# Download Iris Data Set from
# http://archive.ics.uci.edu/ml/datasets/Iris
URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
urllib.request.urlretrieve(URL, "iris.data")
# use pandas" read_csv function to read iris.data into a python array.
# Note: the iris.data is headerless, so header is None.
IRIS_DATA = pd.read_csv("iris.data", header=None)
# Try only versicolor and virginica
LABELS = IRIS_DATA.iloc[50:150, 4].values
DATA = IRIS_DATA.iloc[50:150, [0, 2]].values
# Use scikit-learn's train_test_split function to separate the
# Iris Data Set to a training subset (75% of the data) and
# a test subst (25% of the data)
DATA_TRAIN, DATA_TEST, LABELS_TRAIN, LABELS_TEST = \
model_selection.train_test_split(DATA, LABELS,
test_size=0.25,
random_state=1000)
perceptron_classifier = perceptron_classifier.PerceptronClassifier(
number_of_attributes=2, class_labels=('Iris-versicolor', 'Iris-virginica'))
perceptron_classifier.train(DATA_TRAIN, LABELS_TRAIN, 100)
result = perceptron_classifier.classify(DATA_TEST)
misclassify = 0
for predict, answer in zip(result, LABELS_TEST):
if predict != answer:
misclassify += 1
print("Accuracy rate: %2.2f"
% (100 * (len(result) - misclassify) / len(result)) + "%")
(The complete example can be found here)
Categorical Data
Once again, most machine learning algorithms only accept numerical data. To train a sample contains categorical data, we need a process to encode categorical data to numerical data. Encoding categorical data to numerical data can be distinguished between ordinal and nominal. An ordinal number is a number that can be sorted or ordered. In other words, ordinals indicate rank and position. For instance, education level could be ordinal: graduate school, college, high school, which can be ranked by graduate school > college > high school. In contrast, a nominal number is numeric symbols for labeling only. The values of numerals do not imply any order, quantity, or any other measurement. For example, marital status: single, married, divorced, widowed. It does not make sense to say single is higher or larger (or another measurement) than married.
Encoding Ordinal Data
To make sure a machine learning algorithm interprets the ordinal data properly, we need to transfer the categorical data to numerical data along with the sense of order or rank. Since encoding ordinal data with proper rank requires the knowledge of the data, there is no convenient function to automatically convert the categorical data to numerical data with correct data. Thus, we need to define the mapping manually based on the understanding of the data.
Take the education level as an example; the mapping could be graduate school = college + 1 = high school + 2.
Encoding Nominal Data
One common mistake in dealing with categorical data is that treat nominal data as ordinal data. Take the
- Single -> 0
- Married -> 1
- Divorced -> 2
- Widowed -> 3
Although the martial values do not apply any order or rank, a machine learning algorithm may assume widowed is higher or larger (or another measurement) than divorced, and so on. Based on this incorrect assumption, the machine learning may still produce useful output. However, the performance may not be optimal. One approach to handle this scenario is one-hot encoding. The basic idea is to create dummy features for each unique value in the original categorical data. For example, marital status can be encoded in the following table.
- Single -> (1, 0, 0, 0)
- Married -> (0, 1, 0, 0)
- Divorced -> (0, 0, 1, 0)
- Widowed -> (0, 0, 0, 1)
This way, the machine learning algorithm treats the feature as different labels instead of assuming the feature has rank or order.
The following example demonstrates how to encode categorical data.
import urllib.request
import numpy as np
import pandas as pd
from sklearn import preprocessing
def one_hot_encoder(data=[]) -> []:
"""Transfer categorical data to numerical data based on one hot
encoding approach.
Parameters
----------
data: list of data. Any numerical type.
One dimension list.
Returns
-------
List of int
The list of the encoded data based on one hot encoding approach.
"""
# Since scikit-learn"s OneHotEncoder only accepts numerical data,
# use LabelEncoder to transfer the categorical data to numerical
# by using simple encoding approach. For example, t -> 0; f -> 1
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
LABEL_ENCODER = preprocessing.LabelEncoder()
numerical_data = LABEL_ENCODER.fit_transform(data)
two_d_array = [[item] for item in numerical_data]
# Use scikit-learn OneHotEncoder to encode the A9 feature
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
encoder = preprocessing.OneHotEncoder()
encoder.fit(two_d_array)
return encoder.transform(two_d_array).toarray()
if __name__ == "__main__":
# Download the Japanese Credit Data Set from
# http://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening
URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
urllib.request.urlretrieve(URL, "crx.data")
# Use one-hot encoding to transfer the A9 attribute of the
# Japanese Credit Data Set:
# A9: t, f.
# The encoded output may look like:
# t -> [0, 1]
# f -> [1, 0]
crx_data = pd.read_csv("crx.data", header=None)
A9 = crx_data.iloc[:, 8].values
A9_encoded = one_hot_encoder(A9)
for index in range(len(A9_encoded)):
print(str(A9[index]) + " -> " + str(A9_encoded[index]))
(The complete example can be found here)
Missing Features
The Japanese Credit Screening Data Set has missing features on some samples, which is not an uncommon issue in real-world machine learning problems. Fortunately, a dataset has missing fields, or attributes does not mean it is not useful. Several approaches can be used to fill up the empty gaps.
Remove them
As the title states, this approach is to discard entire rows and columns which contain missing values. Removing is the simplest option but should be considered to use only if the data set is large, for an obvious reason: the fewer samples to train, the harder to avoid overfitting. However, sometime, dropping rows may be effective. For example, a feature has missing values on some example may be just noise. In this case, remove the entire feature column may be effective. However, this type of information may be unknown.
Predict the missing values
The basic idea is to use existing data to predict the missing ones. Apparently, this is not an easy approach. It means we need to define a supervised learning problem, solve it, and use this sub-model to predict the missing data. This way is not only hard to evaluate the performance of the sub-model, but also the accuracy of the sub-model’s prediction affects the learning performance of the overall problem.
Fill up the missing values based on the known values
The first two options are either too difficult to achieve or not practical at all. One reasonable and practical approach is to input the missing data according to the other known values. This strategy replaces missing values by something that can be obtained from existing values such as mean, median, and frequency. The following example demonstrates how to input missing value by frequency and mean.
import collections
import urllib.request
from typing import Any, NoReturn # For type hints
import numpy as np
import pandas as pd
from sklearn import preprocessing
def imputer_by_most_frequent(missing_values: Any, data: []) -> []:
"""Input missing value by frequency, i.e., the value appeared
most often.
Parameters
----------
missing_values: Any
The missing value can be np.nan, "?", or whatever character
which indicates missing value.
data: []
The list of the data.
Returns
-------
List of numerical data
The list of the data based on the most frequent approach.
"""
# Find the value appeared most often by using Counter.
most = collections.Counter(data).most_common(1)[0][0]
complete_list = []
for item in data:
if item is missing_values:
item = most
complete_list.append(item)
return complete_list
if __name__ == "__main__":
# Download Japanese Credit Data Set from http://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening
URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
urllib.request.urlretrieve(URL, "crx.data")
# Use the input by most frequent approach to input the missing
# values in A1:
# A1: b, a
#
# Use the input by mean number approach to input the missing
# values in A2:
# A2: continuous
crx_data = pd.read_csv("crx.data", header=None)
# Since the Japanese Credit Data Set uses "?" to denote missing,
# replace it to np.nan. scikit-learn"s Imputer only accepts np.nan
# or integer, therefore, convert "?" to np.nan.
# This transformation is for A2 which uses scikit-learn"s Imputer.
# For A1 which uses imputer_by_most_frequent(), this transformation
# is not necessary.
crx_data.replace("?", np.nan, inplace=True)
A1_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 0].values)
print(A1_no_missing)
# Use scikit-learn Imputer to input missing values by mean number.
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
imputer = preprocessing.Imputer(missing_values=np.nan,
strategy="mean",
axis=0)
# Convert to two-dimension list, since Imputer only accepts
# two dimensions list.
A2_two_d = np.array([[item] for item in crx_data.iloc[:, 1].values])
A2_no_missing = imputer.fit_transform(A2_two_d)
print(A2_no_missing)
(The complete example can be found here)
Feature Scaling
The majority machine learning algorithms accept whatever numerical data. However, most machine learning algorithms perform better result if features are on the same scale. A machine learning algorithm does not naturally have the sense of various scales on each feature. It is not difficult to get the idea of the importance of feature scaling. For example, assume we have a data set with two features: the first feature is range from 1 to 10, and the range of the second feature is from 1 to 1000. When we apply Perceptron Learning Algorithm on this data set, it is intuitive that the update function is heavily affected by the second feature. Therefore, it may result in a bias output.
Normally, there are two approaches to bring features onto the same scale: normalization and standardization.
Normalization
Normalization means adjusting values measured on different scales to a common scale. Most often, normalization also refers to rescale the features to a range of [0, 1]. In fact, it can be bounded on whatever boundary we want. Normalization can be achieved by the following formula, min-max scaling:
$ x_{norm} = \frac{x – x_{min}}{x_{max} – x_{min}}$, where $ x$ is a sample, $ x_{min}$ is the smallest value in the feature column, and $ x_{max}$ is the largest value.
Although normalization is useful when we need values in a bounded interval, it does have a drawback: if a data set has outliers (which is normal), the normalization equation may be heavily affected by outliers, i.e., the $ x_{max}$ may be huge, and $ x_{min}$ may be tiny. Typically, the normal data may be normalized to a very small interval.
Standardization
In contrast to normalization bounds values in an interval, standardization centers the features at mean with standard deviation, so that the features become the form of normal distribution. The process of standardization can be expressed by the following formula:
$ x_{std} = \frac{x – \mu}{\sigma}$ where $ \mu$ is the sample mean of a particular feature column and $ \sigma$ is the corresponding standard deviation.
This way, standardized data keeps useful information about outliers and makes a learning algorithm less sensitive to the outliers. However, standardization has a drawback too. Unlike normalization, standardization is not bounded.
Example
Now, we have enough tools to demonstrate the Pocket Learning Algorithm on the Japanese Credit Screening Data Set.
import collections
from typing import Any # For type hints
from urllib import request
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import preprocessing
from sklearn import model_selection
import pocket_classifier
import perceptron_classifier
sns.set() # set the default seaborn theme, scaling, and color palette.
def imputer_by_most_frequent(missing_values: Any, data: []) -> []:
"""Input missing value by frequency, i.e., the value appeared
most often.
Parameters
----------
missing_values: Any
The missing value can be np.nan, "?", or whatever character
which indicates missing value.
data: []
The list of the data.
Returns
-------
List of numerical data
The list of the data based on the most frequent approach.
"""
most = collections.Counter(data).most_common(1)[0][0]
complete_list = []
for item in data:
if item is missing_values:
item = most
complete_list.append(item)
return complete_list
def one_hot_encoder(data=[]) -> []:
"""Transfer categorical data to numerical data based on one hot
encoding approach.
Parameters
----------
data: list of data. Any numerical type.
One dimension list.
Returns
-------
List of int
The list of the encoded data based on one hot encoding approach.
"""
LABEL_ENCODER = preprocessing.LabelEncoder()
numerical_data = LABEL_ENCODER.fit_transform(data)
two_d_array = [[item] for item in numerical_data]
encoder = preprocessing.OneHotEncoder()
encoder.fit(two_d_array)
return encoder.transform(two_d_array).toarray()
if __name__ == "__main__":
# Download Japanese Credit Data Set from
# http://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening
URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
request.urlretrieve(URL, "crx.data")
# Use pandas.read_csv module to load adult data set
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
crx_data = pd.read_csv("crx.data", header=None)
crx_data.replace("?", np.nan, inplace=True)
# Transfer the category data to numerical data and input
# missing data:
# A1: b, a. (missing)
# A2: continuous. (missing) mean
# A3: continuous.
# A4: u, y, l, t. (missing) frequency
# A5: g, p, gg. (missing) frequency
# A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff. (missing) frequency
# A7: v, h, bb, j, n, z, dd, ff, o. (missing) frequency
# A8: continuous.
# A9: t, f.
#A10: t, f.
#A11: continuous.
#A12: t, f.
#A13: g, p, s.
#A14: continuous. (missing) mean
#A15: continuous.
#A16: +,- (class label)
A1_no_missing = imputer_by_most_frequent(np.nan,
crx_data.iloc[:, 0].values)
A1_encoded = one_hot_encoder(A1_no_missing)
imputer = preprocessing.Imputer(missing_values=np.nan,
strategy="mean",
axis=0)
A2_two_d = np.array([[item] for item in crx_data.iloc[:, 1].values])
A2_no_missing = imputer.fit_transform(A2_two_d)
A3 = crx_data.iloc[:, 2].values
A4_no_missing = imputer_by_most_frequent(np.nan,
crx_data.iloc[:, 3].values)
A4_encoded = one_hot_encoder(A4_no_missing)
A5_no_missing = imputer_by_most_frequent(np.nan,
crx_data.iloc[:, 4].values)
A5_encoded = one_hot_encoder(A5_no_missing)
A6_no_missing = imputer_by_most_frequent(np.nan,
crx_data.iloc[:, 5].values)
A6_encoded = one_hot_encoder(A6_no_missing)
A7_no_missing = imputer_by_most_frequent(np.nan,
crx_data.iloc[:, 6].values)
A7_encoded = one_hot_encoder(A7_no_missing)
A8 = crx_data.iloc[:, 7].values
A9_encoded = one_hot_encoder(crx_data.iloc[:, 8].values)
A10_encoded = one_hot_encoder(crx_data.iloc[:, 9].values)
A11 = crx_data.iloc[:, 10].values
A12_encoded = one_hot_encoder(crx_data.iloc[:, 11].values)
A13_encoded = one_hot_encoder(crx_data.iloc[:, 12].values)
A14_two_d = np.array([[item] for item in crx_data.iloc[:, 13].values])
A14_no_missing = imputer.fit_transform(A14_two_d)
A15 = crx_data.iloc[:, 14].values
# Aggregate all the encoded data together to a two-dimension set
data = list()
label = list()
for index in range(690):
temp = np.append(A1_encoded[index], A2_no_missing[index])
temp = np.append(temp, A3[index])
temp = np.append(temp, A4_encoded[index])
temp = np.append(temp, A5_encoded[index])
temp = np.append(temp, A6_encoded[index])
temp = np.append(temp, A7_encoded[index])
temp = np.append(temp, A8[index])
temp = np.append(temp, A9_encoded[index])
temp = np.append(temp, A10_encoded[index])
temp = np.append(temp, A11[index])
temp = np.append(temp, A12_encoded[index])
temp = np.append(temp, A14_no_missing[index])
temp = np.append(temp, A15[index])
data.append(temp.tolist())
label.append(crx_data[15][index])
# Use scikit-learn"s MinMaxScaler to scale the training data set.
# http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
min_max_scaler = preprocessing.MinMaxScaler()
data_minmax = min_max_scaler.fit_transform(data)
features = len(data[0])
# Use scikit-learn"s train_test_split function to separate
# the Iris Data Set to a training subset (75% of the data)
# and a test subst (25% of the data).
DATA_TRAIN, DATA_TEST, LABELS_TRAIN, LABELS_TEST = \
model_selection.train_test_split(data_minmax, label,
test_size=0.25,
random_state=1000)
pocket_classifier = pocket_classifier.PocketClassifier(features, ("+", "-"))
pocket_classifier.train(DATA_TRAIN, LABELS_TRAIN, 100)
result = pocket_classifier.classify(DATA_TEST)
misclassify = 0
for predict, answer in zip(result, LABELS_TEST):
if predict != answer:
misclassify += 1
print("Accuracy rate: %2.2f"
% (100 * (len(result) - misclassify) / len(result)) + "%")
(Source code can be found here)
Note that the purpose of this example is to demonstrate Pocket Learning Algorithm and the idea of feature engineering. Therefore, the example may not be the most efficient regarding the feature engineering and python programming; the result (about 85% accurate) can be improved as well.
1 thought on “Machine Learning Basics: Pocket Learning Algorithm and Basic Feature Engineering”