0%

Multiclass Neural Networks

The focus of this article will be on the math behind simple neural networks and implementing the code in python from scratch. Also, the differences between binary and multiclass models will be highlighted.

This article tries to be as detailed as necessary in covering each derivation of the formulas, so that every line of code can be comprehended. This might not be a guide for complete beginners, as some basic knowledge in linear algebra is required.

primer

image.png

A neural network, simply put, a function ϕ\phi that maps input to output. The dimensions of the input/output and its inner topology all are variable. The function itself is not explicitly defined beforehand, but inferred from the dataset.

Rd1    Rd2x=(x1x2)ϕ(a1a2a3)=a\begin{aligned} \R^{d_1} \; &\;\to \quad \R^{d_2} \\ x = \pmatrix{x_1 \\ x_2} &\overset \phi \longmapsto \pmatrix{a_1 \\ a_2 \\ a_3} = a\\ \end{aligned}

Example of a neural network ϕ\phi mapping input xx to output aa.

The inputs and outputs will be column vectors, as I find that more intuitive when working with matrices. This is an arbitrary convention and can is easily changed by transposition.
We can also run a batch of samples through the neural network.

Rd1×NRd2×NX=[x(1)x(N)]ϕ[a(1)a(N)]=A\begin{aligned} \R^{d_1 \x N} \quad &\to \qquad \R^{d_2 \x N}\\ X = \bmatrix{ | & & |\\ x^{(1)} & \cdots & x^{(N)}\\ | & & |\\ } &\overset \phi \longmapsto \bmatrix{ | & & |\\ a^{(1)} & \cdots & a^{(N)}\\ | & & |\\ } = A \end{aligned}

Different datapoints will be noted as upper indices in parentheses, the individual entries as lower indices.

The propagation from one layer to the next consists of an affine linear transformation, xWx+bx \mapsto Wx + b, followed by a non-linear activation function σ\s, applied element-wise.

φ(x)=σ(Wx+b)=σ(z)=(σ(z1)σ(z2)σ(zd2))=(a1a2ad2)=aRd2\begin{aligned} \varphi(x) &= \sigma (Wx + b) = \s(z) \\ &= \veceval{\sigma(z_{\i})}{d_2} = \vecfill{a}{d_2} = a \in \R^{d_2} \end{aligned}

WW is a matrix Rd2×d1\in \mathbb{R}^{d_2\times d_1} (here, a linear Transformation from Rd1\R^{d_1} to Rd2\R^{d_2}) and the bias bb is a scalar R\in \R.

A neural network may be composed of multiple layers. Each layer acting on the previous layer’s output or “neural activation”.

φl(a[l1])=σ[l](W[l1]x+b[l])=a[l]\varphi _ l(a^{[l-1]}) = \s^{[l]}(W^{[l-1]}x + b^{[l]}) = a^{[l]}

ϕ(x)=(φLφ2φ1)(x)=σ[L](W[L]σ[L1](σ[1](W[1]x+b[1]))+b[L])=a[L]\begin{aligned} \phi (x) &= (\varphi _ L \circ \cdots \circ \varphi _2 \circ \varphi _1) (x) \\ &= \s^{[L]}(W^{[L]}\s^{[L-1]}(\cdots\s^{[1]}(W^{[1]}x + b^{[1]})) + b^{[L]}) \\ &= a^{[L]} \\ \end{aligned}

Upper indices in brackets denote the layer from 11 to LL.

It would seem more logical to start with the general multiclass model and then to study the binary model as a special case. Nonetheless, I will begin with the binary model, since calculating derivatives and predicting is simpler in this case.

binary classification

image.png

The regular multiclass model assumes two output dimensions, each representing the probability of the sample belonging to the respective class 00 or 11. Yet, as the probability of belonging to 'class 11’ is 11 minus the probability of belonging to 'class 00': a2=1a1a_2 = 1 - a_1, and vice-versa, one output variable becomes redundant. Only one output ‘neuron’ aa is needed.
The estimated probability of whether the input belongs to each class, given our model ϕ\phi then is

P(x of class 1  ϕ)=aP(x of class 0  ϕ)=1a\begin{aligned} \P(x \text{ of class } &1 \enspace|\; \phi)= a \\ \P(x \text{ of class } &0 \enspace | \;\phi)= 1 - a \\ \end{aligned}

, where a(0,1)a \in (0,1).

imports

Numpy will be used for matrix calculations and matplotlib for plots. Later, scikit will help in creating a toy dataset.

import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets

initialization

The function nn_init() initializes a neural network with random weights and zero bias. It takes a parameter layer_dims, an array specifying the number of neurons for each layer. layer_dims = [2, 8, 1] for example. The first entry is the input dimension and reflects the dimension of the data, the last is the network’s output dimension and is equal to the number of classes in the general case.
The function returns a dictionary param containing the network’s information.
The weights connecting the input to the first layer are stored in param['W1'], the bias in param['b1'], and so on. The number of layers LL is len(layer_dims) - 1 (not counting the input layer).

def nn_init(layer_dims):
    
    L = len(layer_dims) - 1
    
    param = {'L': L}
    
    for l in range(1, L + 1):
        param['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l-1])
        param['b' + str(l)] = np.zeros((layer_dims[l], 1))
    
    return param

forward pass

A nonlinear activation function is applied to each layer after the affine linear transformation. The hidden layers will all be equipped with the rectified linear unit, or relu-activation function.

relu(z)=z{z>0}={z, if z>00otherwiserelu(z) = z_{\{z> 0\}} = \begin{cases} z & \text{, if $z > 0$} \\ 0 & \text{otherwise} \end{cases}

png

def relu(Z):
    return Z * (Z > 0)

The last layer will use the sigmoidsigmoid activation function. This way, the final output will be squished to values between 00 and 11 which then can be used for predicting the classes.

sigmoid(z)=11+ezsigmoid(z) = \frac{1}{1 + e^{-z}}

png

def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))

Each layer’s activation AA is calculated as follows:

A=σ(WAprev+b)A = \s( WA_{prev} + b )

The forward pass begins with the input XX (a single sample or a batch) and iterates over the layers.

def nn_forward(X, param):
    
    L = param['L']
    
    A = X
    cache = {'A0': A }
    for l in range(1, L + 1):
        
        W = param['W' + str(l)]
        b = param['b' + str(l)]
        
        Z = W @ A + b
        A = relu(Z) if l != L else sigmoid(Z)
        
        cache['A' + str(l)] = A
    return A, cache

All activations A[l]A^{[l]} (l1,..Ll \in 1, .. L) are stored in the dictionary named cache, and the input XX in cache['A0'].

loss function

For our network to be able to learn anything, we need to formulate a loss function. For classification tasks, the cross-entropy loss function is a common choice. Its formulation for one sample is:

H(p,q)=iCpilog(qi)\mathcal{H}(p,q) = -\sum_{i \in C} p_i \log(q_i)

, where the sum is over all possible classes CC.

pi=P(x of class i)p_i = \P(x \text{ of class } i)

is the true probability of the sample belonging to class ii, and

qi=P(x of class i  ϕ)q_i = \P(x \text{ of class } i \enspace|\; \phi)

is the predicted probability given the neural network model ϕ\phi.
For the binary case, where the class labels yy are either 00 or 11, the sum expands to

H(p,q)=p1log(q1)p0log(q0)\mathcal{H}(p,q) = -p_1 \log(q_1) - p_0 \log(q_0)

The true probability of a sample belonging to ‘class 1’ is reflected by its label.

p1=yp_1 = y

And the true probability of not belonging to ‘class 1’, but ‘class 0’ is

p0=1yp_0 = 1 - y

Similarly, qiq_i are the network’s predictions, the last layer’s output aa.

q1=aq0=1a\begin{aligned} q_1 &= a \\ q_0 &= 1 - a\end{aligned}

The binary loss function then is simplified to:

L(a,y)=yloga(1y)log(1a)\mathcal{L}(a,y) = -y\log a - (1-y)\log (1-a)

For a batch of samples X=[x(1),..,x(N)]X = [x^{(1)}, .., x^{(N)}], the loss can be averaged.

L(A,Y)=1Nn=0N(y(n)log(a(n))+(1y(n))log(1a(n)))\mathcal{L}(A,Y) = - \frac{1}{N} \sum_{n=0}^N \left(y^{(n)}\log (a^{(n)}) + (1-y^{(n)})\log(1-a^{(n)})\right)

As a reminder:
For now, A=(a(1),..a(N))A = (a^{(1)}, .. a^{(N)}) and Y=(y(1),..y(N))Y = (y^{(1)}, .. y^{(N)}) are row-vectors, and a(n)a^{(n)} and y(n)y^{(n)} scalars, since the output has only one dimension. Later, in the multivariate case, these will be matrices and vectors respectively.

In python:

def loss(A, Y):
    N = A.shape[1]
    return np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A)) / -N

backward pass

Gradient descent is like a blind man trying to reach the bottom of a hill. He is only able to feel the hill’s slope with his feet and take small steps in the hill’s downward direction.

A derivative of a function similarly reflects the rate of change (or slope when graphed) of a parameter. It essentially says how much the output will change by a small tweak in input. In the multivariable case, the gradient points towards the direction of biggest (positive) change in output. In order to drive the network to a lower loss, all parameters are moved in the opposite direction of their gradients, in the ‘direction of negative slope’, by a small amount. Although, in general, this does not guarantee a reduce in cost.

The network’s parameters are its weights and biases. Thus, the gradients dW:=LWdW := \pd{\L}{W}^\top and db:=Lbdb := \pd{\L}{b}^\top need to be evaluated for each layer. For this, the derivatives of the activation functions, the loss and the affine linear functions need to be calculated.

In the case of L(A(Z))\mathcal L(A(Z)) (the loss function), where L\L (a scalar) is dependent on AA (a matrix), which, in turn is dependent on ZZ (a matrix), evaluating the derivative w.r.t. ZZ can be done for all indices using the following chain rule:

Lzij=klLaklaklzij\pd{\L}{z_{ij}} = \sum _{kl} \pd{\L}{a_{kl}}\pd{a_{kl}}{z_{ij}}

In matrix form this is often written as

LZ=LAAZ\pd{\L}{Z} = \pd{\L}{A}\pd{A}{Z}

A word of warning. It is easy to make mistakes using the matrix version of the chain rule without knowing what is going on under the hood. Note that for example, AZ\pd A Z is a 4-tensor and the derivative LA\pd \L A is a linear transformation from Rd×N\R^{d\x N} to R\R, which cannot even be represented as a matrix using conventional notation. Even if we, nonetheless, decide to write the derivative as a matrix, keeping consistency with the numerator layout would make it a matrix of size N×dN\x d. That is why I choose to write LA\pdt \L A when referring to the gradient which has the same dimensions as AA Rd×N\in \R^{d\x N}. I further clarify this .

Rewriting the above equation yields

LZ=AZLA\begin{aligned} \pd{\L}{Z}^\top &= \pd{A}{Z}^\top\pd{\L}{A}^\top \\ \end{aligned}

or

dZ=AZdA\begin{aligned} dZ &= \pd{A}{Z}^\top dA \end{aligned}

Rd×NRd×NRZALdZ ⁣AZdA\begin{aligned} &\R^{d\x N}& &\to& &\R^{d\x N}& &\to& \R \\ &Z& &\mapsto& &A& &\mapsto& \L \\ &dZ& &\underset{\pd{A}{Z}^\top}{\leftmapsto}& &dA& && \end{aligned}

And it can be seen how, in order to calculate the desired derivative dZdZ, the gradient dAdA is ‘backpropagated’.

cross-entropy loss

The cross-entropy loss function

L(A,Y)=1Nn=0N(y(n)log(a(n))+(1y(n))log(1a(n)))\mathcal{L}(A,Y) = - \frac{1}{N} \sum_{n=0}^N \left ( y^{(n)}\log (a^{(n)}) + (1-y^{(n)})\log(1-a^{(n)}) \right )

has the following gradient:

dA:=LA=1N(Y/A+(1Y)/(1A))dA := \pdt \L A = \frac{1}{N}(-Y/A + (1-Y)/(1-A))

Where the expression Y/AY/A above is meant to be performed element-wise.
This can be seen by calculating the derivative w.r.t. one sample. Then a(n)a^{(n)} and y(n)y^{(n)} become scalars:

La(n)=1N(y(n)a(n)+1y(n)1a(n))\frac{\partial \mathcal L}{\partial a^{(n)}} = \frac{1}{N}(-\frac{y^{(n)}}{a^{(n)}} + \frac{1-y^{(n)}}{1-a^{(n)}})

As the summands are independent of each other, the gradient can be written in the above, vectorized form.

activation functions

relu:

relu(z)=1{z0}={1, if z00otherwiserelu'(z) = \mathbb{1}_{\{z\ge 0\}} = \begin{cases} 1 & \text{, if $z \ge 0$} \\ 0 & \text{otherwise} \end{cases}

sigmoid:

σ(z)=σ(z)(1σ(z))=a(1a)\sigma '(z) = \sigma(z)(1 - \sigma(z)) = a(1 - a)

σ(z)=(11+ez)=1(1+ez)2ez(1)=11+ezez1+ez=11+ez1+ez11+ez=11+ez(111+ez)=σ(z)(1σ(z))\begin{aligned} \sigma '(z) &= \left (\frac{1}{1+e^{-z}}\right )' = -1 (1+e^{-z})^{-2}e^{-z}(-1) \\ &= \frac{1}{1+e^{-z}}\frac{e^{-z}}{1+e^{-z}} = \frac{1}{1+e^{-z}}\frac{1 + e^{-z} - 1}{1+e^{-z}} \\ &= \frac{1}{1+e^{-z}}(1 - \frac{1}{1+e^{-z}}) = \sigma(z)(1 - \sigma(z)) \end{aligned}

Where the chain-rule was applied in the first equation.

affine linear functions

Given an affine linear function Z=WX+bZ = WX + b, the derivatives are ZW=X\pdt Z W = \cdot X^\top and Zb=1\pdt Z b = \cdot \vec\1.
A derivation can be found . Many matrix derivatives can also be looked up in the matrix cookbook.

piecing together

Before implementing the code in python, I believe it is helpful to run through a small 2-layer example, from which the general algorithm can be derived. The aim is to evaluate dWdW and dbdb for each layer.

forwardforward backwardbackward
Z[1]=W[1]X+b[1]Z^{[1]} = W^{[1]} X + b^{[1]} \\ \downarrow dW[1]=dZ[1]XTdb[1]=dZ[1]1dW^{[1]} = dZ^{[1]} X^T \\ db^{[1]} = dZ^{[1]}\vec\1 \\ \uparrow
A[1]=relu(Z[1])A^{[1]} = relu(Z^{[1]}) \\ \downarrow dZ[1]=dA[1](Z[1]>0)dZ^{[1]} = dA^{[1]} * (Z^{[1]} > 0) \\ \uparrow
Z[2]=W[2]A[1]+b[2]Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]} \\ \downarrow dW[2]=dZ[2]A[1]Tdb[2]=dZ[2]1dA[1]=W[2]TdZ[2]dW^{[2]} = dZ^{[2]} {A^{[1]}}^T \\ db^{[2]} = dZ^{[2]}\vec\1 \\ dA^{[1]} = {W^{[2]}}^T dZ^{[2]} \\ \uparrow
A[2]=sigmoid(Z[2])A^{[2]} = sigmoid(Z^{[2]}) \\ \downarrow dZ[2]=dAA[2](1A[2])dZ^{[2]} = dA * A^{[2]} *(1 - A^{[2]}) \\ \uparrow
L=1N[Ylog(A)\L = - \frac{1}{N} \sum [Y\log (A) +(1Y)log(1A)]+ (1-Y)\log(1-A)] \longrightarrow dA[2]=1N(Y/A[2]dA^{[2]} = \frac{1}{N}(-Y/A^{[2]} +(1Y)/(1A[2]))+ (1-Y)/(1-A^{[2]}) )

‘*’ and ‘/’ on matrices are used to indicate element-wise actions, keeping close to python’s syntax.

Making sure that the gradients match the dimensions of the variables also can prove helpful. For example one should be able to verify that A2.shape == dA2.shape.

The backpropagation through the last layer can be further simplified, as it is only necessary to calculate dZ[2]dZ^{[2]}.

dZ[2]=dAA[2](1A[2])=1N(Y(1A[2])(1Y)A[2])=1N(A[2](1Y)Y(1A[2]))\begin{aligned} dZ^{[2]} &= dA * A^{[2]} *(1 - A^{[2]}) \\ &= -\frac{1}{N}(Y*(1-A^{[2]}) - (1-Y) * A^{[2]}) \\ &= \frac{1}{N}(A^{[2]} * (1-Y) - Y * (1-A^{[2]})) \end{aligned}

This step also prevents numerical errors that could arise when A[2]A^{[2]} is close to 00 or 11.

Implementation:

def relu_backward(Z):
    return Z > 0
 
def last_layer_backward(A, Y):
    N = A.shape[1]
    return (A * (1-Y) - Y * (1-A)) / N
 
def nn_backward(Y, param, cache):
    L          = param['L']
    A          = cache['A' + str(L)]
    N          = A.shape[1]
    
    grad = {}
    
    for l in reversed(range(1, L + 1)):
        
        A_prev = cache['A' + str(l-1)]
        W      = param['W' + str(l)]
        
        dZ = dA * relu_backward(A) if l !=L else last_layer_backward(A, Y)
        
        grad['dW' + str(l)] = dZ @ A_prev.T
        grad['db' + str(l)] = dZ.sum(axis=1, keepdims=True)
        
        dA = W.T @ dZ
        
        A = A_prev
    
    return grad

parameter update

The parameters are updated using the following rule:

W=WαLW=WαdWW = W - \alpha \pdt \L W = W - \alpha dW

, given the learning rate α\alpha.

def update_param(param, grad, learning_rate=0.01):
    
    for l in range(1, param['L'] + 1):
        param['W' + str(l)] -= grad['dW' + str(l)] * learning_rate
        param['b' + str(l)] -= grad['db' + str(l)] * learning_rate

training

Training the neural network is achieved by iterating over the following steps:

  1. forward pass (evaluating A[l]A^{[l]} for each layer)
  2. backward pass (computing the gradients dW[l]dW^{[l]} and db[l]db^{[l]})
  3. parameter update (tuning the weights and biases)

A list of the computed costs can be stored for logging/graphing the network’s progress.

def train(X, Y, param, epochs=50000, learning_rate=0.001,
          print_every=10000):
    
    costs = []
    
    for epoch in range(epochs + 1):
        
        A, cache = nn_forward(X, param)             #1
        grad = nn_backward(Y, param, cache)         #2
        update_param(param, grad, learning_rate)    #3
        
        if epoch % print_every == 0:
            cost = loss(A, Y)
            costs.append(cost)
            print('epoch {}, loss: {}'.format(epoch, cost))
    
    return costs

predicting and evaluation

When making a class prediction, the class with the higher probability is chosen. Thus, if the estimated probability for ‘class 1’ is greater than 0.5, the prediction is set to ‘class 1’, otherwise it will be ‘class 0’.

The networks empirical correct classification rate is

# correct classifications# samples\frac{\text{\# correct classifications}}{\text{\# samples}}

def predict(X, param):
    A, _ = nn_forward(X, param)
    return A > 0.5
    
def evaluate_classifier(X, Y_true, param):
    Y_pred = predict(X, param)
    return np.mean(Y_pred == Y_true)

toy data

Sci-kit sklearn has built in functions to easily create datasets.

def load_dataset():
    n_samples = 300
    np.random.seed(0)
    
    X, Y = sklearn.datasets.make_moons(n_samples=n_samples, noise=.2) 
    X = X.T
    
    flip    = np.random.choice(range(n_samples), 40)    #add in some outliers
    Y[flip] = 1 - Y[flip]
    
    return X, Y
    
X, Y = load_dataset()
    
plt.scatter(X[0], X[1], c=Y, s=40, cmap='Spectral')
plt.show()

png

The dataset is shuffled and then split into two parts, a training and a testing dataset. The test set is used for evaluating the network’s ability to extrapolate on data it has never seen before.

def split_dataset(X, Y, ratio=0.8):
    N = X.shape[1]
    split = int(N * ratio)
    shuffle = np.random.permutation(N)
    X_train = X[:,shuffle[:split]]
    X_test  = X[:,shuffle[split:]]
    Y_train = Y[shuffle[:split]]
    Y_test  = Y[shuffle[split:]]
    
    return (X_train, Y_train), (X_test, Y_test)
(X_train, Y_train), (X_test, Y_test) = split_dataset(X, Y)
print('X_train shape: {}'.format(X_train.shape))
print('Y_train shape: {}'.format(Y_train.shape))
print('X_test shape: {}'.format(X_test.shape))
print('Y_test shape: {}'.format(Y_test.shape))
X_train shape: (2, 240)
Y_train shape: (240,)
X_test shape: (2, 60)
Y_test shape: (60,)

binary model

Now that all necessary building components are finished, it is time to assemble the network. The number of input dimensions can be read directly from the dataset (in this case the data is 2-dimensional) and the output uses one neuron in the binary case. The dimension of the hidden layer is somewhat arbitrarily set to 8. Passing the specified layer dimensions to the nn_init() function, the parameters of a randomly initialized neural network are returned. The network is then trained on the training dataset.

d_in = X_train.shape[0]
d_out = 1
layer_dims = [d_in, 8, d_out]
param = nn_init(layer_dims)
costs = train(X_train, Y_train, param,
              epochs=500000, learning_rate=0.001, print_every=100000)
epoch 0, loss: 0.8943569024225951
epoch 100000, loss: 0.47308653964775604
epoch 200000, loss: 0.46855221922894785
epoch 300000, loss: 0.40372674060427743
epoch 400000, loss: 0.36625403094480086
epoch 500000, loss: 0.35062643425360634

The next part is about visualizing the performance of our network. It is not essential for understanding neural networks. The code of plot_decision_boundary is mostly adapted from scikit.

code for plot_costs, plot_decision_boundary
def plot_costs(costs, plot_every=1000, learning_rate=.001):
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x{})'.format(plot_every))
    plt.title("learning rate = {}".format(learning_rate))
    plt.show()
def plot_decision_boundary(X, Y, param, contourgrad = False):
    cmap = 'Spectral'
    h = 0.01
    x_min, x_max = X[0,:].min() - 10*h, X[0,:].max() + 10*h
    y_min, y_max = X[1,:].min() - 10*h, X[1,:].max() + 10*h
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    mesh = (np.c_[xx.ravel(), yy.ravel()]).T
    Z = predict(mesh, param)
    Z = Z.T.reshape(xx.shape)
    plt.figure(figsize=(5,5))
    if contourgrad:
        A, _ = nn_forward(mesh, param)
        plt.contourf(xx, yy, A.T.reshape(xx.shape), cmap=cmap, alpha=.3)
    else:
        plt.contourf(xx, yy, Z, cmap=cmap, alpha=.3)
    plt.contour(xx, yy, Z, colors='k', linewidths=0.5)
    plt.scatter(X[0,:], X[1,:], c=Y.squeeze(), cmap=cmap)
    plt.show()
plot_costs(costs)
plot_decision_boundary(X_train, Y_train, param, contourgrad = True)
accuracy = evaluate_classifier(X_train, Y_train, param)
print('accuracy on train set: {}'.format(accuracy))
accuracy = evaluate_classifier(X_test, Y_test, param)
print('accuracy on test set: {}'.format(accuracy))

png

png

accuracy on train set: 0.8666666666666667
accuracy on test set: 0.8

multiclass neural network

The current network architecture only has one output neuron. Given, say, 10 classes, the network’s i-th output neuron then is to reflect the probability of the input belonging to ‘class i’.

a=(a1a2ac)=(P(x of class 1    ϕ)P(x of class 2    ϕ)P(x of class c    ϕ))a = \vecfill{a}{c} = \veceval{\P(\text{x of class }\i \;|\; \phi)}{c}

As an example

a=(0.20.60.1) } class i probabilitya = \begin{pmatrix} 0.2 \\ \vdots \\ 0.6 \\ \vdots \\ 0.1\end{pmatrix} \text{ \} class i probability}

Changing the current code-framework to be able to handle multiple classes involves the following:

  • implementing the softmax layer, to be used as the last layer’s activation function
  • revisiting the original cross-entropy loss formulation
  • calculating the backward pass for the new last layer
  • modifying the prediction function

softmax layer

After propagating the the input through an affine linear layer Z=WX+bZ = WX + b, ZZ can take on all sorts of values, positive or negative. If the 10 outputs are to reflect a probability distribution (i.e. have values between 0 and 1 that all sum to 1), the following can be done.

Each value is mapped to the exponential function, to ensure that it will be positive, and then divided by the sum of all outputs. The result can then be seen as a probability distribution.
This is exactly what softmax does:

softmax(z)=eziezi=asoftmax(z) = \frac{e^z}{\sum_i e^{z_i}} = a

RdRd\R^d \to \R^d

As a side note: in the binary case it was redundant to work with two outputs, since the probability of not belonging to ‘class 1’ is simply

a1=1a1a_1 = 1 - a_1

In that case

a1=ez1ez1+ez2=11+ez1ez2=11+e(z1z2)=sigmoid(z1z2)\begin{aligned} a_1 &= \frac{e^{z_1}}{e^{z_1} + e^{z_2}} = \frac{1}{1 + e^{-z_1}e^{z_2}} \\ &= \frac{1}{1+e^{-(z_1-z_2)}} \\ &= sigmoid(z_1-z_2) \end{aligned}

which can be substituted as

sigmoid(z)sigmoid(z)

Again, we see the binary case emerge as a special case of the general case.

A numerical trick that is often employed in order to prevent the exponential term from blowing up (which may result in unstable computations) is the following:

softmax(z)=ececeziezic=ezciezicsoftmax(z) = \frac{e^{-c}}{e^{-c}}\frac{e^z}{\sum_i e^{z_i-c}} = \frac{e^{z-c}}{\sum_i e^{z_i-c}}

cc is set to maxi{zi}\underset{i}{max}\{ z_i\}, the biggest value, to ensure numerical stability.

def softmax(Z):
    A = np.exp(Z - np.max(Z, axis=0))
    return A / A.sum(axis=0)

cross-entropy loss

As a reminder, the loss for one sample is

L(a,y^)=iCy^ilog(ai)\L(a, \hat y) = -\sum_{i \in C} \hat y_i \log(a_i)

, where y^\hat y is the one hot encoding of y.

The ‘one hat’ above the yy is supposed to remind of the one hot encoding.

Before, a scalar yy was used as a label for sample xx. yy = 3, as an example.
Now, with one hot encoding, it is assumed that y^\hat y is a vector of, say, length 10, with a ‘1’ at the index corresponding to the sample’s class, and ‘0’ everywhere else.

xx of ‘class 3’:

y=3y^=(010)third indexy = 3 \quad \Longrightarrow \quad \hat y = \begin{pmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{pmatrix}\leftarrow \text{third index}

The class label data shape is the same as the neural network’s output.

To reduce overhead we can avoid explicitly converting our labels to one hot encodings by using numpy’s indexing.

L=iCy^ilog(ai)=log(ay)\mathcal L = -\sum_{i \in C} \hat y_i \log(a_i) = -\log(a_y)

Using numpy’s notation to access elements in arrays this would look like -log(a[y]).

For a batch of samples:

L=1Nn=1Nlog(ay(n)(n))\L = -\frac 1 N\sum_{n = 1}^N \log(a_{y^{(n)}}^{(n)})

Using numpy’s advanced indexing, accessing these elements is done via A[Y, range(N)]. This expands to A[[Y(1), Y(2), ..], [1, 2, ..]], which then can be summed with np.sum.

def loss(A, Y):
    N = A.shape[1]
    return np.sum(np.log(A[Y, range(N)])) / -N

In deriving the theory, it is useful to stick to the one-hot formulation:

L(A,Y^)=1Nn=1NiCy^i(n)log(ai(n))=1NY^log(A)\begin{aligned} \L(A, \hat Y) &= -\frac 1 N\sum_{n = 1}^N \sum_{i \in C} \hat y_i^{(n)} \log(a_i^{(n)}) \\&= -\frac 1 N\sum \hat Y \odot \log(A) \end{aligned}

where the sum \sum is over all free indices and ‘\odot’ is the element-wise multiplication

From now on, to make it clearer, ‘\odot’ and ‘\oslash’ will be used for element-wise multiplication and division.

backward pass

The goal is to calculate dZ=LZ=AZLA=AZdAdZ = \pdt \L Z = \pdt A Z \pdt \L A = \pdt A Z dA, the derivative of the last layer.

cross-entropy loss

The derivative of the cross-entropy loss

L(A,Y^)=1NY^log(A)\L(A, \hat Y) = - \frac 1 N \sum \hat Y \odot log(A)

remains as before

dA=LA=1NY^AdA = \pdt \L A = -\frac{1}{N} \hat Y \oslash A

, since all operations are still performed element-wise.

softmax layer

The derivative of the softmax layer AZ\pdt A Z is a bit more involved. The derivative of a matrix-function w.r.t. another matrix is a rank-4 tensor. I cover its full derivation via the chain rule .

Yet, we really only are interested in calculating dZ=LZdZ = \pdt \L Z (a matrix). And since the sum in the loss function involves independent terms and the softmax function acts on each sample individually, it is possible to calculate the derivative dz=Lz=azLadz = \pdt \L z = \pdt a z \pdt \L a w.r.t. one input and then piece together the full derivative dZdZ. The full derivative then becomes LZ=[Lz(1)Lz(N)]\pdt \L Z = \bmatrix{\pdt \L {z^{(1)}} & \cdots & \pdt \L {z^{(N)}}}. And now, the term az\pdt a z is a (more manageable) matrix.

With respect to one sample, the derivative of the loss function is

da=La=1Ny^a=1N(01ay0)da = \pdt \L a = -\frac{1}{N} \hat y \oslash a = -\frac{1}{N} \pmatrix{ 0 \\ \vdots \\\frac{1}{a_y} \\ \vdots \\ 0 }

And the derivative az\pdt a z of the softmax layer

A=softmax(Z)=eZieZiA = softmax(Z) = \frac{e^Z}{\sum_i e^{Z_i}}

is calculated as follows:

i=j:aizi=zi(ezikezk)=1iezieziezi(kezk)2ezi=ai(1ai)ij:aizj=zj(ezikezk)=ezi(kezk)2ezj=aiaj \begin{aligned} i = j: \quad \pd {a_i} {z_i} &= \frac{\partial}{\partial z_i}\left (\frac{e^{z_i}}{\sum_k e^{z_k}}\right ) \\ &= \frac{1}{\sum_i{e^{z_i}}} e^{z_i} - \frac{e^{z_i}}{(\sum_k e^{z_k})^2}e^{z_i} \\ &= a_i(1-a_i) \\ i \neq j: \quad \pd {a_i} {z_j} &= \frac{\partial}{\partial z_j}\left (\frac{e^{z_i}}{\sum_k e^{z_k}}\right ) \\ &= - \frac{e^{z_i}}{(\sum_k e^{z_k})^2}e^{z_j} \\ &= -a_ia_j \end{aligned}

The product rule is used in after the first equation.

In full:

az=[a1(a11)a1a2a2a1a2(a21)]=(aadiag(a))\begin{aligned} \pdt a z &= - \begin{bmatrix} a_1(a_1-1) & a_1a_2 & \cdots \\ a_2a_1 & a_2(a_2-1) & & \\ \vdots & & \ddots \end{bmatrix} \\ & = - (aa^\top - diag(a)) \end{aligned}

Combining both derivatives to obtain dzdz gives:

dz=Lz=azLa=1N(aadiag(a))(y^a)=1N(a(aa)y^diag(aa)y^)=1N(a1y^Iy^)=1N(ay^)\begin{aligned} dz &= \pdt \L z = \pdt a z \pdt \L a \\ &= \frac{1}{N} (aa^\top - diag(a))(\hat y \oslash a) \\ &= \frac 1 N (a\mathbb (a \oslash a) ^\top \hat y - diag(a\oslash a)\hat y) \\ &= \frac 1 N (a \mathbb 1 ^\top \hat y - I\hat y) = \frac 1 N (a - \hat y) \end{aligned}

This is easier seen when expanding the matrices.

dz=Lz=azLa=[a1(a11)a1a2a2a1a2(a21)](01ay0)=1N(a1a2(ay1)ad)=1N(ay^)\begin{aligned} dz &= \pdt \L z = \pdt a z \pdt \L a \\ &= \begin{bmatrix} a_1(a_1-1) & a_1a_2 & \cdots \\ a_2a_1 & a_2(a_2-1) & & \\ \vdots & & \ddots \end{bmatrix} \pmatrix{ 0 \\ \vdots \\\frac{1}{a_y} \\ \vdots \\ 0 }\\ &= \frac{1}{N} \begin{pmatrix} a_1 \\ a_2 \\ \vdots \\ (a_y-1) \\ \vdots \\ a_d \end{pmatrix} = \frac 1 N (a - \hat y) \end{aligned}

ay^a - \hat y results in subtracting 11 from the element of aa with the index corresponding to its true class.

The full derivative now is

dZ=1N(AY^)dZ = \frac 1 N ( A - \hat Y)

In python, again, using advanced indexing:

def last_layer_backward(A, Y):
    N = A.shape[1]
    dZ = np.copy(A) #copy to not modify original
    dZ[Y, range(N)] -= 1
    return dZ / N

predicting

The output layer contains the predicted class probabilities.

a=(a1a2an)=(P(x of class 1    ϕ)P(x of class 2    ϕ)P(x of class n    ϕ))a = \vecfill{a}{n} = \veceval{\P(\text{x of class }\i \;|\; \phi)}{n}

The final prediction is done by selecting the class with the highest estimated probability. This corresponds to the index of the highest value.
For a batch of samples this is done for each sample, i.e. each column of the matrix (axis 0 in numpy).

def predict(X, param):
    A, _ = nn_forward(X, param)
    return np.argmax(A, axis=0)

toy data

A toy dataset can be created by reusing the toy dataset for the binary case and adding multivariate normal distributed samples as a third class.

def load_dataset_multi():
    X, Y = load_dataset()
    Z = np.random.multivariate_normal( [-5, -3], [[5, 4], [4, 5]], size=100) / 5
    X = np.concatenate((X, Z.T), axis=1)
    Y = np.append(Y, (np.ones((1,100)) * 2).astype('int'))
    plt.scatter(X[0], X[1], c=Y, s=40, cmap='Spectral');
    plt.show()
    return X, Y
X, Y = load_dataset_multi()
(X_train, Y_train), (X_test, Y_test) = split_dataset(X, Y)

png

multiclass model

Finally, the multiclass neural network is initialized by specifying the layer dimensions and then is trained on the dataset, just as in the binary case.

d_in = X_train.shape[0]
d_out = Y_train.shape[0]
layer_dims = [d_in, 8, d_out]
param = nn_init(layer_dims)
train(X_train, Y_train, param, learning_rate=.001)
plot_decision_boundary(X_train, Y_train, param)
accuracy = evaluate_classifier(X_train, Y_train, param)
print('accuracy on train set: {}'.format(accuracy))
accuracy = evaluate_classifier(X_test, Y_test, param)
print('accuracy on test set: {}'.format(accuracy))
epoch 0, loss: 0.8492644177201928
epoch 10000, loss: 0.584044408860389
epoch 20000, loss: 0.5478712675155261
epoch 30000, loss: 0.5135834238863085
epoch 40000, loss: 0.500427787036695
epoch 50000, loss: 0.4898009397411907

png

accuracy on train set: 0.8125
accuracy on test set: 0.75

The full python code can be found on my Github. Alternatively, the code can be run directly in a Google Colab notebook here.