Machine Learning Basics and Perceptron Learning Algorithm

[Updated: February 21, 2025]

Table of Contents

AI is probably the hottest topic in the last few years. One reason it has become so promising is the success of machine learning and the significant improvement in computing power. But do you know how machine learning works? Can a machine really learn? It is easy to overlook the fundamentals of machine learning when our minds are filled with fancy machine learning ideas and applications. This article walks you through the core concept of machine learning by implementing the most straightforward learning algorithm. Hopefully, experienced readers will benefit from reviewing the fundamental concepts of machine learning, and newbies will learn the foundations of machine learning from this article.

What is Machine Learning? What can it do?

Machine learning is a subfield of Artificial Intelligence. As its name implies, it tries to help machines learn something from data. Recall how we learned to recognize an animal, for instance, a dog, when we were little kids. We probably started by reading a book that contained dog pictures (i.e., the data), and then we knew it was a dog when we encountered one afterward. That is the essence of machine learning: learning from data.

It is always easy to explain a concept with an example. This Kaggle Competition, Predict Grant Applications, which I participated in several years ago, awarded participants who could provide a model to help them simplify and accelerate the review process to reduce cost.

The traditional way to determine if an applicant qualifies for the grant is for academic reviewers to review applicants’ documents and make decisions. This process takes quite a bit of time, especially since this school claims that the majority of applications are rejected. The valuable academic reviewers’ time is wasted.

Here is where machine learning can help. A machine learning approach can preliminary filter the applications it predicts with a high chance of succeeding. Therefore, the review committee can spend their time on those candidates.

Therefore, the main goal of machine learning is to deduce the future or to make decisions by training mathematical models with context-related data.

How does it work?

Machine learning starts with data. Before answering this question, let’s first look at the data. The following table depicts an excerpt of the Predict Grant Applications Data Set. It demonstrates the common notations used in machine learning, such as samples, features, and labels.

So, how does machine learning work? First, everybody would agree that some attributes have a heavier weight than others in the Predict Grant Applications example. During the review, the academic reviewers may sense that some attributes are more important than others. For instance, the number of journal articles is probably more important than the application’s Person ID. This fact implies that a pattern of a successful application exists. Second, even if we know the pattern exists, it is impossible (or very hard) to write down a math formula to decide who would be granted. Otherwise, we would just write a program using the math equation to solve this problem. Third, this school has a historical record of this program, i.e., they have data. These matters conclude the nature of using machine learning to solve a problem if the problem has the following characteristics:

A pattern exists. (In this Predict Grant Applications example, the reviewers feel that some attributes are more important than others.)
No easy way to solve this problem by a math equation. (This is probably why we need a machine learning approach. If we can find one, a simple math solution is always better than a machine learning solution.)
We have data. (in this example, the University of Melbourne has historical records of applications; without data, there is nothing machine learning can do)

Once a problem meets these three points, machine learning is ready to solve this problem. A basic machine learning approach has the following components.

Target function: an unknown ideal function. It exists, but unknown
Hypothesis set, which contains all possible functions. Examples include Perceptron, Neural Network, Support Vector Machine, etc.
Learning algorithms to pick the optimal function from the hypothesis set based on the data. For instance, the Perceptron Learning Algorithm, backpropagation, quadratic programming, etc.
Learning model: A learning model is usually the combination of a hypothesis set and a learning algorithm.

There are many tools (learning models) for working on machine learning problems. Therefore, the first step is to pick one up. Each learning model has its strengths and weaknesses; some may be good at a certain problem, and some may not. The choice of an appropriate learning model is outside the topic of this article, but we can always pick one as a starting point.

As soon as a learning model is picked, we can start training the model by feeding data. Take the Predict Grant Application as an example again; this process starts with random factors, i.e., the weights of each attribute. Then, turn these factors to make them more and more aligned with how the academic reviewers have reviewed the applications before by feeding the historical records until they can ultimately predict how reviewers determine if applicants are granted. This process is called training model.

After the model is trained, it can predict the results by feeding it new data. In the Predict Grant Application example, the new data could be the current year’s applications. The university can use this model to filter out the applications that are unlikely to succeed so the review committee can focus on the applications that have a high chance of succeeding.

A benefit of the machine learning approach is that it can be generalized to much larger datasets in many dimensions. It is trivial when the problem is in a low number of dimensions, i.e., the number of attributes. However, the reality is that the data set provided by the University of Melbourne has 249 features, making this problem much more complicated (and almost impossible to pin down to a simple math formula). Fortunately, the power of machine learning is that it can be straightforwardly applied and evaluated in the case of data with many more features.

The following section uses probably the simplest learning model to demonstrate the basic workflow of the machine learning approach.

A Simple Example: Perceptron Learning Algorithm

This example uses a classic data set, the Iris Data Set, which contains three classes of 50 instances each, each referring to a type of iris plant. The goal of this example is to use a machine learning approach to build a program to classify the type of iris flowers.

Problem Setup

The Iris Data Set contains three classes (generally called labels): Iris Setosa, Iris Versicolour, and Iris Virginica.

Besides classes, each instance has the following attributes:

Sepal length in cm
Sepal width in cm
Petal length in cm
Petal width in cm

Each instance, a.k.a., sample, of the Iris Data Set looks like (5.1, 3.5, 1.4, 0.2, Iris-setosa)

The learning model this example chooses is Perceptron and the Perceptron Learning Algorithm.

Perceptron Learning Algorithm

Perceptron Learning Algorithm is the simplest form of artificial neural network, i.e., single-layer perceptron. A perceptron is an artificial neuron conceived as a model of biological neurons, which are the elementary units in an artificial neural network. An artificial neuron is a linear combination of certain (one or more) inputs and a corresponding weight vector. That is to say that a perceptron has the following definitions:

Given a data set Ɗ which contains training data set X and output labels Y, and can be formed as a matrix.

Each $ x_i$ is one sample of $ X$, $ x_i = \left\{ x_{i1}, x_{i2}, …, x_{in} \right\} \forall i \in N, i = 1, 2,…, n$

Each $ y_i$ is the actual label and has a binary value: $ \left\{ -1, 1 \right\}$.

$ W$ is the weight vector. Each $ x_{ij}$ has a corresponding weight $ w_j$

Since a perceptron is a linear combination of $ X$ and $ W$, it can be denoted as

$ \begin{bmatrix} x_{11} & \cdots & x_{1n} \\ \vdots & x_{ij} & \vdots \\ x_{m1} & \cdots & x_{mn} \end{bmatrix} \begin{bmatrix} w_1 \\ \vdots \\ w_n \end{bmatrix}$

For each neuron $ y_j$, the output is $ y_j = \phi \left( \displaystyle\sum_{j=1}^{n}x_{ij}w_j \right)$, where the $ \phi$ is the transfer function:

$ \phi = \begin{cases} 1 & \quad \text{if } \sum_{j=1}^{n}x_{ij}w_j>\theta \text{ where } \theta \text{ is the predefined threshold} \\ -1 & \quad \text{otherwise} \end{cases} $

For the sake of simplicity, $ \theta$ can be treated as $ x_{i0}w_0$, where $ x_{i0} = 1$

Therefore, the linear combination of $ X$ and $ W$ can be rewritten as

$ \begin{bmatrix} x_{01} & \cdots & x_{0n} \\ \vdots & x_{ij} & \vdots \\ x_{m1} & \cdots & x_{mn} \end{bmatrix} \begin{bmatrix} w_0 \\ \vdots \\ w_n \end{bmatrix}$

The output of jth neuron is $ y_j = \phi \left( \displaystyle\sum_{j=0}^{n}x_{ij}w_j \right)$, where the $ \phi$ is the transfer function: $ \phi = \begin{cases} 1 & \quad \text{if } \sum_{j=0}^{n}x_{ij}w_j>0 \\ -1 & \quad \text{otherwise} \end{cases} $

The Learning Steps

Given the definition of Perception, the Perceptron Learning Algorithms works by the following steps:

Initialize the weights to 0 or small random numbers.
For each training sample, $ x_i$, perform the following sub steps:
1. Compute $ \phi$, the linear combination of $ X$ and $ W$, to get the predicted output, class label $ p_j$.
2. Update the weights $ W$
3. Record the number of misclassifications
If the number of misclassifications is not zero after training the whole set, $ X$, repeat step 2 and start from the beginning of the training set, i.e., start from $ x_{0j}$. Repeat this step until the number of misclassifications is 0.

Note:

The output of 2.1 is the class predicted by the $ \phi$ function.

In the step 2.2, the update of each weight, $ w_j$, of $ W$ follows the rule:

$ w_j \left( t + 1\right) = w_j \left( t \right) + \Delta w_j = w_j \left( t \right) + \left( p_j – y_j \right) x_{ij}$, where $ t$ indicatees the step: $ t$ means current step, and $ t+1$ means next step. Therefore, $ w_j\left( t \right)$ indicates the current weight and $ w_j\left( t+1 \right)$ indicates the weight after updated.

If no misclassify, $ \Delta w_j = \left( -1 – \left( -1 \right) \right) x_{ij} = 0$ or $ \Delta w_j = \left( 1 – 1 \right) x_{ij} = 0$. In this case, $ w_j \left( t + 1 \right) = w_j \left( t \right)$. No update.

If misclassify, $ \Delta w_j = \left( -1 – 1 \right) x_{ij} = -2 x_{ij}$ or $ \Delta w_j = \left( 1 – \left( -1 \right) \right) x_{ij} = 2 x_{ij}$. In this case, $ w_j \left( t + 1 \right) = w_j \left( t \right) + 2 x_{ij}$ or $ w_j \left( t + 1 \right) = w_j \left( t \right) – 2 x_{ij}$. Weight updates.

In step 2.3, the convergence of the Perceptron Learning Algorithm (PLA) is only guaranteed if the two classes are linearly separable. If they are not, the PLA never stops. One simple modification is the Pocket Learning Algorithm, which is discussed in Machine Learning Basics: Pocket Learning Algorithm and Basic Feature Engineering.

The Perceptron Learning Algorithm can be simply implemented as follows:

"""A Perceptron Classifier."""

import numpy as np

from typing import Any, Tuple


class PerceptronClassifier:
    """Perceptron Binary Classifier uses Perceptron Learning Algorithm.

    Parameters
    ----------
    number_of_attributes: int
        The number of attributes of the data set.
    class_labels: tuple of the class labels
        The class labels can be anything as long as it has only two
        types of labels.

    Attributes
    ----------
    weights: list of float
        The list of weights corresponding to the input attributes.
    misclassify_record: list of int
        The number of misclassification for each training.

    Methods
    -------
    train(samples: list[list], labels: list, max_iterator: int = 10)
        Train the perceptron learning algorithm with samples.
    classify(new_data: list[list]) -> list[int]
        Classify the input data.

    See Also
    --------
    See details at:
    Machine Learning Basics and Perceptron Learning Algorithm


    Examples
    --------
    Two dimensions list and each sample has four attributes
    >>> import perceptron_classifier
    >>> samples = [[5.1, 3.5, 1.4, 0.2],
                   [4.9, 3.0, 1.4, 0.2],
                   [4.7, 3.2, 1.3, 0.2],
                   [4.6, 3.1, 1.5, 0.2],
                   [5.0, 3.6, 1.4, 0.2],
                   [5.4, 3.9, 1.7, 0.4],
                   [7.0, 3.2, 4.7, 1.4],
                   [6.4, 3.2, 4.5, 1.5],
                   [6.9, 3.1, 4.9, 1.5],
                   [5.5, 2.3, 4.0, 1.3],
                   [6.5, 2.8, 4.6, 1.5],
                   [5.7, 2.8, 4.5, 1.3]]
    Binary classes with class -1 or 1.
    >>> labels = [-1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1]
    >>> perceptron_classifier = perceptron_classifier.PerceptronClassifier(4, (-1, 1))
    >>> perceptron_classifier.train(samples, labels)
    >>> new_data = [[6.3, 3.3, 4.7, 1.6], [4.6, 3.4, 1.4, 0.3]]
    Predict the class for the new_data
    >>> perceptron_classifier.classify(new_data)
    [1, -1]
    """

    def __init__(self, number_of_attributes: int, class_labels: Tuple):
        # Initialize the weights to all zero.
        # The size is the number of attributes plus the bias,
        # i.e., x_0 * w_0.
        self.weights = np.zeros(number_of_attributes + 1)

        # Record of the number of misclassify for each training sample
        self.misclassify_record: list[int] = []

        # Build the label map to map the original labels to numerical
        # labels. For example, ["a", "b"] -> {0: "a", 1: "b"}
        self._label_map = {1: class_labels[0], -1: class_labels[1]}
        self._reversed_label_map = {class_labels[0]: 1, class_labels[1]: -1}

    def _linear_combination(self, sample: list) -> Any:
        """Linear combination of sample and weights."""
        return np.inner(sample, self.weights[1:])

    def train(self, samples: list[list], labels: list, max_iterator: int = 10) -> None:
        """Train the model with samples.

        Parameters
        ----------
        samples: two dimensions list
            The training data set.
        labels: list of labels
            The class labels of the training data.
        max_iterator: int, optional
            The max iterator to stop the training process in case the
            training data is not converged. The default is 10.
        """
        # Transfer the labels to numerical labels
        transferred_labels = [self._reversed_label_map[index] for index in labels]

        for _ in range(max_iterator):
            misclassifies = 0
            for sample, target in zip(samples, transferred_labels):
                linear_combination = self._linear_combination(sample)
                update = target - np.where(linear_combination >= 0.0, 1, -1)

                # use numpy.multiply to multiply element-wise
                self.weights[1:] += np.multiply(update, sample)
                self.weights[0] += update

                # record the number of misclassification
                misclassifies += int(update != 0.0)

            if misclassifies == 0:
                break
            self.misclassify_record.append(misclassifies)

    def classify(self, new_data: list[list]) -> list[int]:
        """Classify the sample based on the trained weights.

        Parameters
        ----------
        new_data : two dimensions list
            New data to be classified.

        Returns
        -------
        list[int]
            The list of predicted class labels.
        """
        predicted_result = np.where(
            (self._linear_combination(new_data) + self.weights[0]) >= 0.0, 1, -1
        )

        return [self._label_map[item] for item in predicted_result]

(Source code can be found here)

Apply Perceptron Learning Algorithm onto Iris Data Set

Normally, the first step in applying a machine learning algorithm to a data set is transforming it into something or a format the algorithm can recognize. This process may involve normalization, dimension reduction, and feature engineering. For example, most machine learning algorithms only accept numerical data. Therefore, a data set needs to be transferred to a numerical format.

In the Iris Data Set, the attributes, sepal length, sepal width, petal length, and petal width, are numerical values, but the class labels are not. Therefore, in the PerceptronClassifier class, the train function transfers the class labels to a numerical format. A simple way is to use numbers to indicate these labels: $ \{setosa, versicolour, virginica\} \rightarrow \{0,1,2\} $, which means 0 indicates Setosa, 1 implies Versicolour, and 2 means Virginica. Then, the Iris Data Set can be viewed in the form below to feed into the Perceptron Learning Algorithm.

$ X = \left\{ x_0, x_1, x_2, x_3, x_4 \right\} = \left\{ 1, sepal-length, sepal-width, petal-legnth, petal-width \right\}$

$ Y = \left\{0,1,2 \right\}$

Perceptron is a binary classifier. However, the Iris Data Set has three labels. There are two common ways to deal with multiclass problems: one-vs-all and one-vs-one. For this section, we use a simplified one-vs-one strategy to determine the types of iris plants.

The one-vs-one approach trains a model for each pair of classes and determines the correct class by a majority vote. For example, the Iris Data Set has three classes. That means all the combinations of a pair of the three classes. That means $ \{\{setosa, versicolour\}, \{setosa, virginica\}, \{versicolour, virginica\}\}$

Besides, machine learning is not restricted to using all the features in the data set. Instead, only important features are necessary. Here, we only consider the two features: the Sepal width and the Petal width. In fact, choosing the right features is so important that there is a subject called Feature Engineering to deal with this problem.

import urllib.request
import numpy as np
import pandas as pd

import perceptron_classifier

# Download Iris Data Set from http://archive.ics.uci.edu/ml/datasets/Iris
URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
urllib.request.urlretrieve(URL, "iris.data")
# Use pandas.read_csv module to load iris data set
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
IRIS_DATA = pd.read_csv("iris.data", header=None)

# Prepare the training data and test data
# The original Iris Data Set has 150 samples and 50 samples for each class.
# This example takes first 40 samples of each class as training data,
# and the other 10 samples of each class as testing data for verification.
# 0 ~ 39: setosa training set
# 40 ~ 49: setosa testing set
# 50 ~ 89 versicolor training set
# 90 ~ 99: versicolor testing set
# 100 ~ 139: virginica training set
# 140 ~ 149: virginica testing set
# Use pandas iloc to select samples by position and return an one-dimension array
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc
SETOSA_LABEL = IRIS_DATA.iloc[0:40, 4].values
VERSICOLOR_LABEL = IRIS_DATA.iloc[50:90, 4].values
VIRGINICA_LABEL = IRIS_DATA.iloc[100:140, 4].values

SETOSA_VERSICOLOR_TRAINING_LABEL = np.append(SETOSA_LABEL, VERSICOLOR_LABEL)
SETOSA_VIRGINICA_TRAINING_LABEL = np.append(SETOSA_LABEL, VIRGINICA_LABEL)
VERSICOLOR_VIRGINICA_TRAINING_LABEL = np.append(VERSICOLOR_LABEL, VIRGINICA_LABEL)

# In this example, it uses only Sepal width and Petal width to train.
SETOSA_DATA = IRIS_DATA.iloc[0:40, [1, 3]].values
VERSICOLOR_DATA = IRIS_DATA.iloc[50:90, [1, 3]].values
VIRGINICA_DATA = IRIS_DATA.iloc[100:140, [1, 3]].values

# Use one-vs-one strategy to train three classes data set, so we need
# three binary classifiers:
# setosa-versicolor, setosa-viginica, and versicolor-viginica
SETOSA_VERSICOLOR_TRAINING_DATA = np.append(SETOSA_DATA, VERSICOLOR_DATA, axis=0)

SETOSA_VIRGINICA_TRAINING_DATA = np.append(SETOSA_DATA, VIRGINICA_DATA, axis=0)

VERSICOLOR_VIRGINICA_TRAINING_DATA = np.append(VERSICOLOR_DATA, VIRGINICA_DATA, axis=0)

# Prepare test data set. Use only Sepal width and Petal width as well.
SETOSA_TEST = IRIS_DATA.iloc[40:50, [1, 3]].values
VERSICOLOR_TEST = IRIS_DATA.iloc[90:100, [1, 3]].values
VIRGINICA_TEST = IRIS_DATA.iloc[140:150, [1, 3]].values
TEST = np.append(SETOSA_TEST, VERSICOLOR_TEST, axis=0)
TEST = np.append(TEST, VIRGINICA_TEST, axis=0)

# Prepare the target of test data to verify the prediction
SETOSA_VERIFY = IRIS_DATA.iloc[40:50, 4].values
VERSICOLOR_VERIFY = IRIS_DATA.iloc[90:100, 4].values
VIRGINICA_VERIFY = IRIS_DATA.iloc[140:150, 4].values
VERIFY = np.append(SETOSA_VERIFY, VERSICOLOR_VERIFY)
VERIFY = np.append(VERIFY, VIRGINICA_VERIFY)

# Define a setosa-versicolor Perceptron() with 2 attributes
perceptron_setosa_versicolor = perceptron_classifier.PerceptronClassifier(
    number_of_attributes=2, class_labels=("Iris-setosa", "Iris-versicolor")
)

# Train the model
perceptron_setosa_versicolor.train(
    SETOSA_VERSICOLOR_TRAINING_DATA, SETOSA_VERSICOLOR_TRAINING_LABEL
)

# Define a setosa-virginica Perceptron() with 2 attributes
perceptron_setosa_virginica = perceptron_classifier.PerceptronClassifier(
    number_of_attributes=2, class_labels=("Iris-setosa", "Iris-virginica")
)

# Train the model
perceptron_setosa_virginica.train(
    SETOSA_VIRGINICA_TRAINING_DATA, SETOSA_VIRGINICA_TRAINING_LABEL
)

# Define a versicolor-virginica Perceptron() with 2 attributes
perceptron_versicolor_virginica = perceptron_classifier.PerceptronClassifier(
    number_of_attributes=2, class_labels=("Iris-versicolor", "Iris-virginica")
)

# Train the model
perceptron_versicolor_virginica.train(
    VERSICOLOR_VIRGINICA_TRAINING_DATA, VERSICOLOR_VIRGINICA_TRAINING_LABEL
)

# Run three binary classifiers
predict_target_1 = perceptron_setosa_versicolor.classify(TEST)
predict_target_2 = perceptron_setosa_virginica.classify(TEST)
predict_target_3 = perceptron_versicolor_virginica.classify(TEST)

overall_predict_result = []
for item in zip(predict_target_1, predict_target_2, predict_target_3):
    unique, counts = np.unique(item, return_counts=True)
    temp_result = zip(unique, counts)
    # Sort by values and return the class that has majority votes
    overall_predict_result.append(
        sorted(temp_result, reverse=True, key=lambda tup: tup[1])[0][0]
    )
    # The result should look like:
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-setosa", 2), ("Iris-versicolor", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-versicolor", 2), ("Iris-virginica", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]
    # [("Iris-virginica", 2), ("Iris-versicolor", 1)]

# Verify the results
misclassified = 0
for predict, verify in zip(overall_predict_result, VERIFY):
    if predict != verify:
        misclassified += 1
print("The number of misclassified: " + str(misclassified))

(Source code can be found here)

(The complete implementation of the Perceptron Learning Algorithm can be found here)

Is Learning Feasible?

Many people may see machine learning as magic. It is not. Machine learning is all about math, especially probability and statistics.

Learning vs. Memorizing

To show if learning is feasible, we need to define what learning is. In the section, A Simple Example, the Perceptron Learning Algorithm eventually successfully classified all the iris flowers, but is it learning? Or is it just memorizing? To answer these questions, we define in-sample error and out-of-sample error. The in-sample error means the error rate within the sample. For example, in the end, the Perceptron learning model can classify all the iris flowers in the Iris Data Set. Its in-sample error is 0, i.e., no error at all. In contrast to in-sample error, the out-of-sample error means the error rate outside the sample. In other words, it indicates the number of errors when the learning model sees new data. Take the same example: if we feed new data to the Perceptron learning model, the misclassification rate is its out-of-sample error. Therefore, to say a learning model is really learning, the in-sample error has to be small, and the out-of-sample error has to be small.

So, is Learning Feasible?

The simple answer is yes: the learning is feasible in the mathematical sense.

In probability and statistics, there are two important theories that can briefly show that learning is feasible: the Central Limit Theorem and the Law of Large Numbers. The Central limit theorem states that the distribution of the average of a large number of independent and identically distributed variables will be approximately normal distribution, regardless of the underlying distribution. Law of Large Numbers describes that as more samples are collected, the average of the results should be close to the expected value and will tend to become closer as more trials are performed. The mathematical theory guarantees that a sample proportion or a difference in sample proportions will follow something that resembles a normal distribution as long as 1. Observations in the sample are independent. 2. The sample is large enough. In an oversimplified sense, the learning results we see in the sample should be seen in out-of-sample data. Therefore, learning is feasible.

Of course, this statement is oversimplified. The details of theories that support the feasibility of machine learning are not part of the topic of this article. There are many resources to address this topic. One resource I recommended is the book Learning from Data, especially chapter 2.

In short, to make learning feasible, the learning model has to achieve:

In-sample error is small
Out-of-sample error is close to the in-sample error

How do we know if we learn well?

The conclusion in the previous section addresses that if the in-sample error is small and the out-of-sample error is close to the in-sample error, then the learning model learns something. What does it mean the in-sample error is small and the out-sample error is close to the in-sample error? How small should the in-sample error be, and how close could the out-of-sample error be? Of course, the learning model learns perfectly if both in-sample and out-of-sample errors are zero. Unfortunately, most real-world problems are not like this. In fact, the machine learning approach is not about finding the ideal target function; instead, the machine learning approach looks for a hypothesis function $ f$ which approximates the target function $ g$. (the target function is always unknown; otherwise, we would not bother to use a machine learning approach.) In other words, the machine learning approach looks for a good enough solution $ f$, which is close enough to $ g$. What does good enough mean? To quantify how well $ f$ is close to $ g$, we need a way to define the distance between $ f$ and $ g$. Usually, it is called an error measure or error function (the error is also referred to as cost or risk).

Error Measure

In general, we define a non-negative error measure by taking two arguments, expected and predicted output, and computing a total error value over the whole dataset.

$ Error = E \left( h, g \right)$, where $ h$ is a hypothesis and $ g$ is the target function. $ E \left( h, g \right)$ is based on the errors on individual input $ x_i$. Therefore, we can define a pointwise error measure $ e \left( h \left( x_i \right), g \left( x_i \right) \right)$. The overall error, then, is the average value of this pointwise error. $ Error_{overall} = \frac{1}{n} \sum_{i=1}^{n} | h \left( x_i \right) – g \left( x_i \right) |$

The choice of an error measure affects the choice of the learning model. Even if the data set and the target function are the same, the meaning of error measure may vary depending on different problems — for example, an email spam classifier. There are two error outcomes of a learning model for this email spam problem: false positive and false negative. The former means an email is spam, but the learning model determines it is not; the latter indicates that the learning model says an email is spam, but it is not. Imagine two scenarios: our personal email account and a working email account.

If the learning model fails to filter out spam emails in the personal email account case, it is probably fine—just annoying. On the other hand, if the learning model filters out some email that is not spam, for example, an email from friends or a credit card company, it is probably fine, too. The case can be measured in the table below.

	An email is a spam	An email is not spam
Classify an email as spam	0	1
Classify an email as not spam	1	0

If the learning model classifies an email correctly, the cost of this error is 0. Otherwise, it is 1.

In the other case: working email account. Similar to the personal account example, if a learning model fails to filter out spam, it is annoying but fine. However, if the learning model classifies the emails from our boss as spam, and leads us to miss these emails, then it is probably not fine. Therefore, in this case, the cost of false negative has heavier weight than the previous example.

	An email is a spam	An email is not spam
Classify an email as spam	0	10
Classify an email as not spam	1	0

Again, the error function (or cost function, risk function) really depends on different problems and should be defined by customers.

Overfitting and Underfitting

The conclusion of the section, Is Learning Feasible, shows that to make learning feasible, the learning model has to achieve that the in-sample error is small and the out-of-sample error is close to the in-sample error. Normally, a training set represents the global distribution, but it is not possible to contain all possible elements. Therefore, when training a model, it is necessary to try to fit the training data, i.e., keep the in-sample error small, but it is also necessary to try to keep the model able to generalize when the unseen input is presented, i.e., keep the out-of-sample error small. Unfortunately, this ideal condition is not easy to find, and it is important to be careful of these two phenomena: overfitting and underfitting.

The figure shows a normal classification of two classes of data.

Underfitting means that the learning model is not able to capture the pattern shown by the training data set. The picture below demonstrates the case of underfitting.

Underfitting is usually easy to observe because the learning model does not fit the in-sample data well when it occurs. In other words, when underfitting occurs, the in-sample error is high.

Overfitting is the opposite of underfitting. It means that the learning model fits the training data very well but too well to fit the out-of-sample data. That is to say, an overfitting learning model loses its generalization, so when the unknown input is presented, the corresponding prediction error can be very high.

Different from underfitting, overfitting is difficult to notice or prevent because the in-sample error of an overfitting learning model is usually very low. We may think we trained this model very well until new data comes and the out-of-sample error is high. Then, we realize that the model is overfitting.

Overfitting usually happens when the learning model is more complex than is necessary to represent the target function. In short, the overfitting issue can be summarized as follows:

The number of data points increases, and the chance of overfitting decreases.
If the data set has much noise, the chance of overfitting is high.
The more complex the model is, the easier overfitting happens.

Types of Learning

Based on the type of data and problems we want to solve, we can roughly categorize learning problems into three groups: supervised learning, unsupervised learning, and reinforcement learning.

Supervised Learning

Supervised Learning is probably the most well-studied learning problem. Essentially, in supervised learning, each sample contains the input data and explicit output, i.e., correct output. In other words, supervised learning can be formed as (input, correct output).

The Iris Data Set is a typical supervised learning model. It has five input attributes: sepal length, sepal width, petal length, petal width, and correct output, which are types of the iris plant. We can visualize the supervised learning by plotting the Iris Data Set. For the sake of simplicity, we plot a 2-D picture by taking sepal length and petal length as input and using different colors and symbols to indicate each class.

(The sample code can be found here)

The main goal of supervised learning is to learn a model from data with correct labels to make predictions on unseen or future data. Common supervised learning problems include Classification for predicting class labels and regression for predicting continuous outcomes, such as spam detection, pattern detection, and natural language processing.

Unsupervised Learning

In contrast to supervised learning, the unsupervised data set does not contain correct output and can be formed as (input, ?).

If we do not plot the type as different colors, the Iris Data Set becomes an example of unsupervised learning.

(The sample code can be found here)

In many cases, a data set only has input but does not contain a correct output, like the picture above. Then, how can we learn something from unsupervised learning problems? Fortunately, although there is no correct output, learning something from the input is possible. For example, in the picture above, each ellipse represents a cluster, and all the points inside its area can be labeled similarly. (of course, we knew the Iris Data Set has three classes; here, we just pretend we did not know that Iris Data Set has three classes). Using unsupervised learning techniques, we can explore the structure of a data set to extract meaningful information without having any prior knowledge of their group. Unsupervised learning problems do not come with labels; they deal with problems like discovering hidden structures with unsupervised learning, finding subgroups with clustering, and dimensionality reduction for data compression.

Reinforcement Learning

Different from supervised learning, which has an output, and unsupervised learning, which contains no correct output, the reinforcement learning data set contains input and some grade of output and can be formed as (input, some output with grade).

One good example of reinforcement learning is playing games such as chess and Go. The next step is determined based on the feedback of the current state. Once the decision to make the new step is made, a grade for this step is observed. The goal of reinforcement learning is to find the sequence of the most useful actions so the reinforcement learning model can always make the best decision.

Online vs. Offline

Learning problems can also be viewed by the way models are trained.

Online Learning

The data set is given to the algorithm one example at a time. Online learning happens when we have streaming data that the algorithm has to process ‘’on the run’’.

Offline Learning

Instead, give the algorithm one example at a time; we have data to train the algorithm in the beginning. The Iris Data Set example is offline learning.

Both online and offline learning can be applied to supervised, unsupervised, and reinforcement learning.

Challenges in Machine Learning

With the development of machine learning (ML) and artificial intelligence (AI), many AI/ML-powered tools and programs have shown unlimited potential in many fields. However, machine learning is not the silver bullet that can solve every problem. Instead, there are many significant weaknesses that many researchers are eager to solve. The following are some challenges the machine learning industry faces today.

First, there is no easy way to transfer a learned model from one learning problem to another. As humans, when we are good at one thing, say playing Go, chances are we are also good at similar games such as playing chase, bridge, or, at least, easy to learn. Unfortunately, machine learning needs to start over the learning process for every problem, even if the problem is similar to others. For example, the Perceptron learning model we trained to classify iris plants in the previous section can only classify iris plants – nothing more. If we want to classify different flowers, we must train a new learning model from scratch.

The second challenge is that there is no smart way to label data. Machine learning can do many things and learn from all kinds of data. Nevertheless, it does not know how to label data. Yes, we can train a learning model to recognize an animal, say a cat, by feeding many cat pictures. However, in the beginning, a human needs to label the pictures if it is a cat picture. Even if unsupervised learning can help us group data, it still does not know if a group of pictures is a cat. Labeling data still requires human involvement.

Third, even the creator of a machine learning solution does not know why the machine learning solution makes this decision. The essence of machine learning is learning from data. We feed data to a learning model, and it predicts the results. In the Perceptron Learning Algorithm example, the weights of the final hypothesis may look like [ -4.0, -8.6, 14.2], but it is not easy to explain why the learning model gave us these weights. Even with the simplest learning algorithm, Perceptron, we are not able to explain why. Needless to say, it is almost impossible to explain how more sophisticated learning algorithms work.

With deep learning success and the significant improvement of computation power, we can train a lot more data faster than ever before. However, deep learning’s performance heavily depends on the quantity and quality of training data—the more, the better. This leads to another challenge: the generation of high-quality data cannot keep up with the consumption of deep learning. Besides, as the importance of information privacy grows, it’s much harder to collect user-related data. Therefore, collecting data is also getting more and more expensive. For this problem, one may think, how about we feed the AI-generated data to train a model? The paper, AI models collapse when trained on recursively generated data | Nature, shows that training on generated data will cause irreversible defects in the resulting models, and the authors called this phenomenon model collapse.

Although machine learning has shown significant promising results and usability, many challenges are still waiting to be resolved.

Recommendation Reading

As mentioned at the beginning of this article, machine learning is a very hot topic today. There are a lot of books and online resources talking about it. Most focus on using libraries, e.g., OpenCV, TensorFlow, and machine learning algorithms such as SVM and neural networks. All of these are good stuff. However, to master these tools, a comprehensive understanding of machine learning theory is reasonably necessary. Therefore, I highly recommend this book, Learning From Data. The authors, Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin, did a great job explaining the fundamentals of machine learning in mathematical and practical ways. Of course, it talks about a lot of mathematics. They are not complicated but definitely take time to understand fully. Once we understand the fundamental concept of machine learning comprehensively, it is easier for us to learn advanced topics in machine learning. This book is highly recommended for people who want to build a solid foundation in machine learning.