Machine Learning Basics: Pocket Learning Algorithm and Basic Feature Engineering

[Updated: February 16, 2025]

Table of Contents

In the previous article, Machine Learning Basics and Perceptron Learning Algorithm, the assumption was that the Iris Data Set trained by the Perceptron Learning Algorithm is linearly separable. Hence, the misclassifications on each training iteration eventually converge to 0. This is not always the case in the real world. Therefore, a question must be asked: What will happen if a data set is not linearly separable? How do we handle this scenario?

The previous article also shows that, at least, Versicolor and Virginia are not linearly separable. (See picture below)

(Source code can be found at iris_example.py)

If we use the Perceptron Learning Algorithm to train these two types of iris plants, the weight update function, i.e., the loop in the training function, never stops. The iteration ends eventually because we set a limit of the number of iterations, not because the misclassifications converge to zero. In addition, the trained model’s output is not stable. In other words, the output does not seem optimal when the iteration meets the limit.

If we run the Perceptron Learning Algorithm on the Iris Data Set with both Versicolor and Virginia, the trend of the number of misclassifications looks like the picture below.

(Source code can be found at iris_example.py)

The picture shows that the trend of the number of misclassifications jumps up and down.

Note: The minimum in-sample error does not guarantee that the learning model will perform the same on out-of-sample error. If the in-sample error is low but the out-of-sample error is high, it is called overfitting.

In short, when we use a linear model, like Perceptron Learning Algorithm, on a non-separable data set, the following things may happen:

The update function never stops
The final output does not guarantee that the in-sample error is optimal

One approach to approximate the optimal in-sample error is to extend the Perceptron Learning Algorithm by modifying it, forming the Pocket Learning Algorithm.

Pocket Learning Algorithm

The idea is very simple: this algorithm keeps the best result seen so far in its pocket (that is why it is called Pocket Learning Algorithm). The best result means the number of misclassifications is minimal. If the new weights produce fewer misclassifications than the weights in the pocket, replace the weights in the pocket with the new weights; if the new weights are not better than the ones in the pocket, keep the ones in the pocket and discard the new weights. At the end of the training iteration, the algorithm returns the solution in the pocket rather than the last solution.

The Learning Steps

Initialize the pocket weight vector, $ W_{pocket}$, to 0 or small random numbers, and use this weight vector as the initialized weight vector, $ W_0$ of the Perceptron Learning Algorithm.
For each training iteration, perform the following sub-steps:
1. Run the training step of the Perceptron Learning Algorithm to obtain the updated weight vector, $ W_t$, where $ t$ indicates the current iteration.
2. Evaluate $ W_t$ by comparing the number of misclassifications on the entire sample set with the number of misclassifications performed by $ W_{pocket}$.
3. If $ W_t$ is better than $ W_{pocket}$, replace $ W_{pocket}$ to $ W_t$.
Return $ W_{pocket}$ when the training iteration terminates.

The Pocket Learning Algorithm can be simply implemented by extending the Perceptron Learning Algorithm.

"""Pocket Classifier."""


import numpy as np

from typing import Any, Tuple


class Pocket:
    """The class keeps the best weights seen so far in the learning process.

    Parameters
    ----------
    number_of_attributes: int
        The number of attributes of the data set.

    Attributes
    ----------
    best_weights: list of float
        The list of the best weights seen so far.
    misclassify_count: int
        The number of misclassification corresponding to the best
        weights.
    """

    def __init__(self, number_of_attributes: int):
        self.best_weights = np.zeros(number_of_attributes + 1)

        # -1 means the class is initialized but does not have valid value
        self.misclassify_count = -1


class PocketClassifier:
    """Pocket Binary Classifier.

    Parameters
    ----------
    number_of_attributes: int
        The number of attributes of the data set.
    class_labels: tuple of the class labels
        The class labels can be anything as long as it has
        only two types of labels.

    Attributes
    ----------
    pocket: Pocket
        The pocket contains the best training result so far and the
        number of the misclassified sample according to the result
        in the pocket.
    weights: list of float
        The list of weights corresponding input attributes.
    misclassify_record: list of int
        The number of misclassification for each training sample.

    Methods
    -------
    train(samples: list[list], labels: list, max_iterator: int = 10)
        Train the perceptron learning algorithm with samples.
    classify(new_data: list[list]) -> list[int]
        Classify the input data.

    See Also
    --------
    See details at:
    Machine Learning Basics: Pocket Learning Algorithm and Basic Feature Engineering


    Examples
    --------
    Two dimensions list and each sample has four attributes
    >>> import pocket_classifier
    >>> samples = [[5.1, 3.5, 1.4, 0.2],
                   [4.9, 3.0, 1.4, 0.2],
                   [4.7, 3.2, 1.3, 0.2],
                   [4.6, 3.1, 1.5, 0.2],
                   [5.0, 3.6, 1.4, 0.2],
                   [5.4, 3.9, 1.7, 0.4],
                   [7.0, 3.2, 4.7, 1.4],
                   [6.4, 3.2, 4.5, 1.5],
                   [6.9, 3.1, 4.9, 1.5],
                   [5.5, 2.3, 4.0, 1.3],
                   [6.5, 2.8, 4.6, 1.5],
                   [5.7, 2.8, 4.5, 1.3]]
    Binary classes with class -1 or 1.
    >>> labels = [-1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1]
    >>> pocket_classifier = pocket_classifier.PocketClassifier(4, (-1, 1))
    >>> pocket_classifier.train(samples, labels)
    >>> new_data = [[6.3, 3.3, 4.7, 1.6], [4.6, 3.4, 1.4, 0.3]]
    Predict the class for the new_data
    >>> pocket_classifier.classify(new_data)
    [1, -1]
    """

    def __init__(self, number_of_attributes: int, class_labels: Tuple):
        # Initialize the Pocket class.
        self.pocket = Pocket(number_of_attributes)
        # Initialize the weights to zero.
        # The size is the number of attributes
        # plus the bias, i.e. x_0 * w_0.
        self.weights = np.zeros(number_of_attributes + 1)

        # Record of the number of misclassify for each training sample.
        self.misclassify_record: list[int] = []

        # Build the label map to map the original labels to numerical
        # labels. For example, ["a", "b"] -> {0: "a", 1: "b"}
        self._label_map = {1: class_labels[0], -1: class_labels[1]}
        self._reversed_label_map = {class_labels[0]: 1, class_labels[1]: -1}

    def _linear_combination(self, sample: list) -> Any:
        """Linear combination of sample and weights."""
        return np.inner(sample, self.weights[1:])

    def train(self, samples: list[list], labels: list, max_iterator: int = 10) -> None:
        """Train the model with samples.

        Parameters
        ----------
        samples: two dimensions list
            Training data set.
        labels: list of labels
            The class labels of the training data.
        max_iterator: int, optional
            The max iterator to stop the training process.
            The default is 10.
        """
        # Transfer the labels to numerical labels
        transferred_labels = [self._reversed_label_map[index] for index in labels]

        for _ in range(max_iterator):
            misclassifies = 0
            for sample, target in zip(samples, transferred_labels):
                linear_combination = self._linear_combination(sample)
                update = target - np.where(linear_combination >= 0.0, 1, -1)

                # use numpy.multiply to multiply element-wise
                self.weights[1:] += np.multiply(update, sample)
                self.weights[0] += update

                # record the number of misclassification
                misclassifies += int(update != 0.0)

            # Update the pocket is the result is better than the one
            # in the pocket.
            if (
                (self.pocket.misclassify_count == -1)
                or (self.pocket.misclassify_count > misclassifies)
                or (misclassifies == 0)
            ):

                self.pocket.best_weights = self.weights
                self.pocket.misclassify_count = misclassifies

            if misclassifies == 0:
                break

            self.misclassify_record.append(self.pocket.misclassify_count)

    def classify(self, new_data: list[list]) -> list[int]:
        """Classify the sample based on the trained weights.

        Parameters
        ----------
        new_data: two dimensions list
            New data to be classified.

        Return
        ------
        List of int
            The list of predicted class labels.
        """
        predicted_result = np.where(
            (self._linear_combination(new_data) + self.weights[0]) >= 0.0, 1, -1
        )

        return [self._label_map[item] for item in predicted_result]

(Complete source code can be found here)

Note that this algorithm is straightforward but not perfect. The results seem stochastic because the optimal result is the best result seen so far.

Apply the Pocket Learning Algorithm to the Japanese Credit Screening Data Set

It is always easy to illustrate an idea by an example. This section uses the Japanese Credit Screening Data Set from the University of California, Irvine. The data set has 125 samples and two classes (positive and negative) of people who were or were not granted credit. Each sample has 15 features; some features are missing in some examples. Besides, the types of features include categorical, real, and integer types. This example demonstrates how to train the Pocket Learning Algorithm to determine if an applicant will be approved for unseen data in the future using the Japanese Credit Screening Data Set.

However, most machine learning algorithms strongly assume that the data set is numerical. The Pocket Learning Algorithm is the same. For the Pocket Learning Algorithm to work on the Japanese Credit Screening Data Set, the non-numerical features need to be transformed into numerical type. This type of process is usually called feature engineering. The following section briefly introduces feature engineering.

Basic Feature Engineering

The quality of the data and the amount of useful information it consists of are the key points that determine how well the performance of a machine learning algorithm can learn. In other words, using raw data sets directly can negatively affect the performance of a learning process. Therefore, feature engineering is usually the first step in the machine learning process.

Training and Test Sets

Overfitting is a common pitfall of machine learning processes. It could happen when a learning process achieves zero in-sample error but a huge out-of-sample error on yet-unseen data. To avoid overfitting, a common practice is to hold part of the data as a test data set and only use the rest of the data to train the model. Once the learning model is trained, we use the test data set to verify the model’s performance.

In the previous article’s example, we manually separate the Iris Data Set into a training set and a test set by taking the first 10 samples from each class and aggregating them into a test set; the rest are the training set. This is not what we usually would do (not what we should do either). If a data set has been affected at any step in the learning process, its performance in accessing the outcome has been compromised. This issue is called data snooping. Manually manipulating the data set may result in data snooping bias. (Data snooping bias is not part of the topic of this article and will be discussed in future articles.)

The appropriate way to create a training set and a test set is that the test data set should be picked randomly and meet the following criteria:

Both the training and the test sets must reflect the original distribution.
The original data set must be randomly shuffled.

Scikit-learn, an open-source machine learning library, provides a convenient function for splitting data into training and test sets. The following example demonstrates the sci-kit split function on the Iris Data Set and the Perceptron Learning Algorithm.

"""An example of supervised learning uses the Iris data set.
https://archive.ics.uci.edu/ml/datasets/Iris
Attribute Information:
0. sepal length in cm
1. sepal width in cm
2. petal length in cm
3. petal width in cm
4. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica
"""

import urllib.request

# pandas is an open source library providing high-performance,
# easy-to-use data structures and data analysis tools. http://pandas.pydata.org/
import pandas as pd

# scikit-learn is a python machine learning library. train_test_split
# function splits a data set to a train and a test subsets.
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
from sklearn import model_selection

import perceptron_classifier

# Download Iris Data Set from
# http://archive.ics.uci.edu/ml/datasets/Iris
URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
urllib.request.urlretrieve(URL, "iris.data")
# use pandas" read_csv function to read iris.data into a python array.
# Note: the iris.data is headerless, so header is None.
IRIS_DATA = pd.read_csv("iris.data", header=None)

# Try only versicolor and virginica
LABELS = IRIS_DATA.iloc[50:150, 4].values
DATA = IRIS_DATA.iloc[50:150, [0, 2]].values

# Use scikit-learn's train_test_split function to separate the
# Iris Data Set to a training subset (75% of the data) and
# a test subst (25% of the data)
DATA_TRAIN, DATA_TEST, LABELS_TRAIN, LABELS_TEST = model_selection.train_test_split(
    DATA, LABELS, test_size=0.25, random_state=1000
)

perceptron_classifier = perceptron_classifier.PerceptronClassifier(
    number_of_attributes=2, class_labels=("Iris-versicolor", "Iris-virginica")
)

perceptron_classifier.train(DATA_TRAIN, LABELS_TRAIN, 100)
result = perceptron_classifier.classify(DATA_TEST)

misclassify = 0
for predict, answer in zip(result, LABELS_TEST):
    if predict != answer:
        misclassify += 1

print("Accuracy rate: %2.2f" % (100 * (len(result) - misclassify) / len(result)) + "%")

(The complete example can be found here)

Categorical Data

Once again, most machine learning algorithms only accept numerical data. To train a sample containing categorical data, we need to encode categorical data to numerical data. Encoding categorical data to numerical data can be distinguished between ordinal and nominal. An ordinal number is a number that can be sorted or ordered. In other words, ordinals indicate rank and position. For instance, education level could be ordinal: graduate school, college, and high school can be ranked by graduate school > college > high school. In contrast, a nominal number is only a numeric symbol for labeling. The values of numerals do not imply any order, quantity, or other measurement. For example, marital status: single, married, divorced, widowed. It does not make sense to say that being single is higher or larger (or another measurement) than being married.

Encoding Ordinal Data

To ensure a machine learning algorithm interprets the ordinal data correctly, we need to transfer categorical data to numerical data along with the sense of order or rank. Since encoding ordinal data with proper rank requires understanding the data, there is no easy way to convert the categorical data to numerical data automatically. Thus, we need to define the mapping manually based on our understanding of the data.

Take the education level as an example; the mapping could be graduate school = college + 1 = high school + 2.

Encoding Nominal Data

One common mistake in dealing with categorical data is treating nominal data as ordinal. Take marital status as an example; if we create a 1-to-1 mapping between marital status and its corresponding numerical number, the mapping may look like the following:

Single -> 0
Married -> 1
Divorced -> 2
Widowed -> 3

Although martial values do not apply any order or rank, a machine learning algorithm may assume widowed is higher or larger (or another measurement) than divorced. Therefore, the incorrect assumption could lead to a machine learning algorithm producing poor results. One approach to handle this scenario is one-hot encoding. The basic idea is to create dummy features for each unique value in the original categorical data. For example, marital status can be encoded in the following table.

Single	Married	Divorced	Widowed
1	0	0	0
0	1	0	0
0	0	1	0
0	0	0	1

Single -> (1, 0, 0, 0)
Married -> (0, 1, 0, 0)
Divorced -> (0, 0, 1, 0)
Widowed -> (0, 0, 0, 1)

This way, the machine learning algorithm treats the features as different labels instead of assuming they have rank or order.

The following example demonstrates how to encode categorical data.

import urllib.request

import pandas as pd

from sklearn import preprocessing


def one_hot_encoder(data: list) -> list:
    """Transfer categorical data to numerical data based on one hot encoding approach.

    Parameters
    ----------
    data: list of data. Any numerical type.
        One dimension list.

    Returns
    -------
    List of int
        The list of the encoded data based on one hot encoding approach.
    """
    # Since scikit-learn"s OneHotEncoder only accepts numerical data,
    # use LabelEncoder to transfer the categorical data to numerical
    # by using simple encoding approach. For example, t -> 0; f -> 1
    # http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
    LABEL_ENCODER = preprocessing.LabelEncoder()
    numerical_data = LABEL_ENCODER.fit_transform(data)
    two_d_array = [[item] for item in numerical_data]

    # Use scikit-learn OneHotEncoder to encode the A9 feature
    # http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
    encoder = preprocessing.OneHotEncoder()
    encoder.fit(two_d_array)
    return encoder.transform(two_d_array).toarray()


if __name__ == "__main__":
    # Download the Japanese Credit Data Set from
    # http://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening
    URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
    urllib.request.urlretrieve(URL, "crx.data")

    # Use one-hot encoding to transfer the A9 attribute of the
    # Japanese Credit Data Set:
    # A9:   t, f.
    # The encoded output may look like:
    # t -> [0, 1]
    # f -> [1, 0]

    crx_data = pd.read_csv("crx.data", header=None)
    A9 = crx_data.iloc[:, 8].values
    A9_encoded = one_hot_encoder(A9)

    for index in range(len(A9_encoded)):
        print(str(A9[index]) + " -> " + str(A9_encoded[index]))

(The complete example can be found here)

Missing Features

The Japanese Credit Screening Data Set has missing features on some samples, which is common in real-world machine learning problems. Fortunately, a dataset with missing fields or attributes does not mean it is not useful. Several approaches can be used to fill in the empty gaps.

Remove them

This approach discards rows or columns containing missing values, as the title states. Removing is the most straightforward option but should be considered only if the data set is large. The reason is simple: the fewer samples to train, the harder it is to avoid overfitting. Besides, dropping may be a reasonable approach. For example, a missing feature in some examples may be just noise. In this case, removing the entire feature column may be effective.

Predict the missing values

The basic idea is to use existing data to predict the missing ones. The method includes defining a supervised learning problem, solving it, and using this sub-model to predict the missing data. This method makes it difficult to evaluate the performance of the sub-model, and the accuracy of the sub-model’s prediction affects the overall learning performance.

Fill up the missing values based on the known values

The first two options are either too difficult to achieve or impractical. One reasonable and practical approach is to input the missing data according to the other known values. This strategy replaces missing values by something that can be obtained from existing values such as mean, median, and frequency. The following example demonstrates how to input missing value by frequency and mean.

import collections
import urllib.request

import numpy as np
import pandas as pd

from typing import Any
from sklearn import impute


def imputer_by_most_frequent(missing_values: Any, data: list) -> list:
    """Input missing value by frequency, i.e., the value appeared most often.

    Parameters
    ----------
    missing_values: Any
        The missing value can be np.nan, "?", or whatever character
        which indicates missing value.

    data: list
        The list of the data.

    Returns
    -------
    List of numerical data
        The list of the data based on the most frequent approach.
    """
    # Find the value appeared most often by using Counter.
    most = collections.Counter(data).most_common(1)[0][0]
    complete_list = []
    for item in data:
        if item is missing_values:
            item = most
        complete_list.append(item)
    return complete_list


if __name__ == "__main__":
    # Download Japanese Credit Data Set from http://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening
    URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
    urllib.request.urlretrieve(URL, "crx.data")

    # Use the input by most frequent approach to input the missing
    # values in A1:
    # A1:   b, a
    #
    # Use the input by mean number approach to input the missing
    # values in A2:
    # A2:   continuous

    crx_data = pd.read_csv("crx.data", header=None)
    # Since the Japanese Credit Data Set uses "?" to denote missing,
    # replace it to np.nan. scikit-learn"s Imputer only accepts np.nan
    # or integer, therefore, convert "?" to np.nan.
    # This transformation is for A2 which uses scikit-learn"s Imputer.
    # For A1 which uses imputer_by_most_frequent(), this transformation
    # is not necessary.
    crx_data.replace("?", np.nan, inplace=True)

    A1_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 0].values)
    print(A1_no_missing)

    # Use scikit-learn Imputer to input missing values by mean number.
    # http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
    imputer = impute.SimpleImputer(missing_values=np.nan, strategy="mean")
    # Convert to two-dimension list, since Imputer only accepts
    # two dimensions list.
    A2_two_d = np.array([[item] for item in crx_data.iloc[:, 1].values])
    A2_no_missing = imputer.fit_transform(A2_two_d)
    print(A2_no_missing)

(The complete example can be found here)

Feature Scaling

The majority of machine learning algorithms accept whatever numerical data. However, most machine learning algorithms perform better results if features are on the same scale. A machine learning algorithm does not naturally have the sense of various scales on each feature. For example, assuming we have a data set with two features, the first feature ranges from 1 to 10, and the second feature ranges from 1 to 1000. When we apply the Perceptron Learning Algorithm to this data set, it is intuitive that the update function is heavily affected by the second feature. Therefore, it may result in a biased output. Typically, two approaches to bringing features onto the same scale are normalization and standardization.

Normalization

Normalization means adjusting values measured on different scales to a standard scale. Usually, normalization also refers to rescaling the features to a range of [0, 1] (but it can be bounded on whatever boundary we want). The following equation, min-max scaling, can achieve normalization:

$ x_{norm} = \frac{x – x_{min}}{x_{max} – x_{min}}$, where $ x$ is a sample, $ x_{min}$ is the smallest value in the feature column, and $ x_{max}$ is the largest value.

Although normalization is useful when we need values in a bounded interval, it does have a drawback: if a data set has outliers (which is normal), the normalization equation may be heavily affected by outliers, e.g., the $ x_{max}$ is huge, or $ x_{min}$ is tiny.

Standardization

In contrast to normalization bounds values in an interval, standardization centers the features at mean with standard deviation so that the features become the form of normal distribution. The following equation shows the process of standardization:

$ x_{std} = \frac{x – \mu}{\sigma}$ where $ \mu$ is the sample mean of a particular feature column and $ \sigma$ is the corresponding standard deviation.

This way, standardized data keeps useful information about outliers but makes a learning algorithm less sensitive to the outliers. However, standardization has a drawback too. Unlike normalization, standardization is not bounded.

Example

The following example concludes the article by demonstrating feature engineering methods to clean the source data and apply the Pocket Learning Algorithm in the Japanese Credit Screening Data Set.

import collections

import numpy as np
import pandas as pd
import seaborn as sns

from typing import Any
from urllib import request

from sklearn import impute
from sklearn import model_selection
from sklearn import preprocessing

import pocket_classifier


sns.set_theme()  # set the default seaborn theme, scaling, and color palette.


def imputer_by_most_frequent(missing_values: Any, data: list) -> list:
    """Input missing value by frequency, i.e., the value appeared
    most often.

    Parameters
    ----------
    missing_values: Any
        The missing value can be np.nan, "?", or whatever character
        which indicates missing value.
    data: list
        The list of the data.

    Returns
    -------
    List of numerical data
        The list of the data based on the most frequent approach.
    """
    most = collections.Counter(data).most_common(1)[0][0]
    complete_list = []
    for item in data:
        if item is missing_values:
            item = most
        complete_list.append(item)
    return complete_list


def one_hot_encoder(data: list) -> list:
    """Transfer categorical data to numerical data based on one hot
    encoding approach.

    Parameters
    ----------
    data: list of data. Any numerical type.
        One dimension list.

    Returns
    -------
    List of int
        The list of the encoded data based on one hot encoding approach.
    """
    LABEL_ENCODER = preprocessing.LabelEncoder()
    numerical_data = LABEL_ENCODER.fit_transform(data)
    two_d_array = [[item] for item in numerical_data]

    encoder = preprocessing.OneHotEncoder()
    encoder.fit(two_d_array)
    return encoder.transform(two_d_array).toarray()


if __name__ == "__main__":
    # Download Japanese Credit Data Set from
    # http://archive.ics.uci.edu/ml/datasets/Japanese+Credit+Screening
    URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data"
    request.urlretrieve(URL, "crx.data")
    # Use pandas.read_csv module to load adult data set
    # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
    crx_data = pd.read_csv("crx.data", header=None)
    crx_data.replace("?", np.nan, inplace=True)

    # Transfer the category data to numerical data and input
    # missing data:
    # A1: b, a. (missing)
    # A2: continuous. (missing) mean
    # A3: continuous.
    # A4: u, y, l, t. (missing) frequency
    # A5: g, p, gg. (missing) frequency
    # A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff. (missing) frequency
    # A7: v, h, bb, j, n, z, dd, ff, o. (missing) frequency
    # A8: continuous.
    # A9: t, f.
    # A10: t, f.
    # A11: continuous.
    # A12: t, f.
    # A13: g, p, s.
    # A14: continuous. (missing) mean
    # A15: continuous.
    # A16: +,- (class label)

    A1_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 0].values)

    A1_encoded = one_hot_encoder(A1_no_missing)

    imputer = impute.SimpleImputer(missing_values=np.nan, strategy="mean")

    A2_two_d = np.array([[item] for item in crx_data.iloc[:, 1].values])
    A2_no_missing = imputer.fit_transform(A2_two_d)

    A3 = crx_data.iloc[:, 2].values

    A4_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 3].values)
    A4_encoded = one_hot_encoder(A4_no_missing)

    A5_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 4].values)
    A5_encoded = one_hot_encoder(A5_no_missing)

    A6_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 5].values)
    A6_encoded = one_hot_encoder(A6_no_missing)

    A7_no_missing = imputer_by_most_frequent(np.nan, crx_data.iloc[:, 6].values)
    A7_encoded = one_hot_encoder(A7_no_missing)

    A8 = crx_data.iloc[:, 7].values

    A9_encoded = one_hot_encoder(crx_data.iloc[:, 8].values)

    A10_encoded = one_hot_encoder(crx_data.iloc[:, 9].values)

    A11 = crx_data.iloc[:, 10].values

    A12_encoded = one_hot_encoder(crx_data.iloc[:, 11].values)

    A13_encoded = one_hot_encoder(crx_data.iloc[:, 12].values)

    A14_two_d = np.array([[item] for item in crx_data.iloc[:, 13].values])
    A14_no_missing = imputer.fit_transform(A14_two_d)

    A15 = crx_data.iloc[:, 14].values

    # Aggregate all the encoded data together to a two-dimension set
    data = list()
    label = list()
    for index in range(690):
        temp = np.append(A1_encoded[index], A2_no_missing[index])
        temp = np.append(temp, A3[index])
        temp = np.append(temp, A4_encoded[index])
        temp = np.append(temp, A5_encoded[index])
        temp = np.append(temp, A6_encoded[index])
        temp = np.append(temp, A7_encoded[index])
        temp = np.append(temp, A8[index])
        temp = np.append(temp, A9_encoded[index])
        temp = np.append(temp, A10_encoded[index])
        temp = np.append(temp, A11[index])
        temp = np.append(temp, A12_encoded[index])
        temp = np.append(temp, A14_no_missing[index])
        temp = np.append(temp, A15[index])
        data.append(temp.tolist())
        label.append(crx_data[15][index])

    # Use scikit-learn"s MinMaxScaler to scale the training data set.
    # http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
    min_max_scaler = preprocessing.MinMaxScaler()
    data_minmax = min_max_scaler.fit_transform(data)

    features = len(data[0])

    # Use scikit-learn"s train_test_split function to separate
    # the Iris Data Set to a training subset (75% of the data)
    # and a test subst (25% of the data).
    DATA_TRAIN, DATA_TEST, LABELS_TRAIN, LABELS_TEST = model_selection.train_test_split(
        data_minmax, label, test_size=0.25, random_state=1000
    )

    classifier = pocket_classifier.PocketClassifier(features, ("+", "-"))
    classifier.train(DATA_TRAIN, LABELS_TRAIN, 100)

    result = classifier.classify(DATA_TEST)

    misclassify = 0
    for predict, answer in zip(result, LABELS_TEST):
        if predict != answer:
            misclassify += 1
    print(
        "Accuracy rate: %2.2f" % (100 * (len(result) - misclassify) / len(result)) + "%"
    )

(Source code can be found here)

Machine Learning Basics: Pocket Learning Algorithm and Basic Feature Engineering

Pocket Learning Algorithm

The Learning Steps

Apply the Pocket Learning Algorithm to the Japanese Credit Screening Data Set

Basic Feature Engineering

Training and Test Sets

Categorical Data

Encoding Ordinal Data

Encoding Nominal Data

Missing Features

Remove them

Predict the missing values

Fill up the missing values based on the known values

Feature Scaling

Normalization

Standardization

Example

1 thought on “Machine Learning Basics: Pocket Learning Algorithm and Basic Feature Engineering”

Leave a Reply Cancel reply