0%

K-nearest neighbors

K-nearest neighbors is a simple form of machine-learning. It is an algorithm designed to extrapolate and generalize from a given dataset. This post will cover implementing the algorithm in python with the final goal of applying it to the cifar-10 dataset.

The idea of k-nearest neighbors is to be able to make predictions of a parameter on an unseen datapoint by weighing in said parameter values of the k “closest” datapoints. It is a quite general algorithm that can be applied on top of many other machine-learning algorithms as the performance largely depends on the distance metric used.

Cifar-10

The cifar-10 dataset consists of 60000 labeled 32x32 RBG pixel images. It can be downloaded from here. Once downloaded, pickle can be used to load the batch files into a dictionary in python.

Alternatively, using keras:
from keras.datasets import cifar10
(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()
X_train = X_train.reshape(-1, 3 * 32 * 32)
X_train = X_train.astype('float32') / 255.
X_test = X_test.reshape(-1, 3 * 32 * 32)
X_test = X_test.astype('float32') / 255.
Y_train = Y_train.squeeze()
Y_test = Y_test.squeeze()

import numpy as np
import matplotlib.pyplot as plt
import pickle
def unpickle(file):
    with open(file, 'rb') as f:
        data = dict(pickle.load(f, encoding = 'latin1'))
    return data
def load_cifar(folder):
    
    Y_train = np.array([])
    X_train = np.array([])
    # load batches
    label_names = unpickle(folder + 'batches.meta')['label_names']
    for i in range(1,6):
        batch = unpickle(folder + 'data_batch_' + str(i))
        X_train = np.append(X_train, batch['data'])
        Y_train = np.append(Y_train, batch['labels'])
    batch_test = unpickle(folder + 'test_batch')
    X_test = np.array(batch_test['data'])
    Y_test = np.array(batch_test['labels'])
    
    # reshape and process
    X_train = X_train.reshape(-1, 3 * 32 * 32)
    X_train = X_train.astype('float32') / 255.
    
    X_test = X_test.reshape(-1, 3 * 32 * 32)
    X_test = X_test.astype('float32') / 255.
    
    Y_train = Y_train.astype('uint8')
    Y_test = Y_test.astype('uint8')
    return  (X_train, Y_train), (X_test, Y_test), label_names
cifar_folder = '/Users/k/Documents/ECL/Apprent/cifar-10-batches-py/'
(X_train, Y_train), (X_test, Y_test), label_names = load_cifar(cifar_folder)
print('X_train shape:')
print(X_train.shape)
print('Y_train shape:')
print(Y_train.shape)
print('X_test shape:')
print(X_test.shape)
print('Y_test shape:')
print(Y_test.shape)
print('\ncategories:')
print(label_names)
X_train shape:
(50000, 3072)
Y_train shape:
(50000,)
X_test shape:
(10000, 3072)
Y_test shape:
(10000,)

categories:
['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

The data is loaded as a vector of length Nbatch33232=30720000N_{batch} \cdot 3 \cdot 32 \cdot 32 = 30720000 per batch. After concatenating, the full data vector needs to be reshaped into a matrix XX of dimension N×D=N×3072=N×33232N \times D = N \times 3072 = N \times 3 \cdot 32 \cdot 32 (where NN is the number of samples and DD is the data dimension size), by using .reshape(-1, 3 * 32 * 32). Then, for example, X[0] returns the first image as a vector of length 332323 \cdot 32 \cdot 32.

Pyplot’s imshow can be used to display an image. For this, the data vector xx needs to be resized to 3×32×323 \times 32 \times 32. Also since pyplot’s convention is to have the RGB information in the last axis, the RGB channel (2nd axis) has to be moved to the back using .transpose(1,2,0).

def show(x, title):
    x = x.reshape(3, 32, 32).transpose(1,2,0)
    plt.imshow(x)
    plt.title('label: {}'.format(title))
    plt.axis('off')
    plt.show()
show(X_train[6], label_names[Y_train[6]])

png

Et Voila! An ostrich.

The next step splits a dataset (X, Y) into a training and a test set, given a ratio. This is not strictly necessary, as our dataset has already been split for us. Although we can use this to work with a smaller subset for development.

def split_dataset(X, Y, ratio):
    
    N = len(Y)
    split = int(N * ratio)
    
    X_train = X[:split]
    Y_train = Y[:split]
    X_test = X[split:]
    Y_test = Y[split:]
    
    return (X_train, Y_train), (X_test, Y_test)
X_small = X_train[:10000]
Y_small = Y_train[:10000]
(X_train, Y_train), (X_test, Y_test) = split_dataset(X_small, Y_small, .8)
print('training set size: {}'.format(len(X_train)))
print('test set size: {}'.format(len(X_test)))

K-nearest neighbor distances

Central to the performance of the algorithm is the distance function d(x,y)d(x,y) which measures closeness of two datapoints. Since we are working with a batch of datapoints for which we would like to make predictions, we will be working with a matrix Dist={d}i,j=xiyj2Dist = \{d\}_{i,j} = ||x^i - y^j||^2 containing all the pairwise distances between each image of the training and in this case, the testing set. These pairwise distances will then be used to predict labels for all images of the testing set. A simple distance metric is the l2l2-metric, the sum of the squared distance of each pixel value.

Mathsy explanation

In the following short part, the training set and the test set will be denoted as

X=[x1xNX]NX×DX = \begin{bmatrix}- x^1 -\\ \vdots \\ - x^{N_X} -\end{bmatrix} \in N_X \times D and Y=[y1yNY]NY×DY = \begin{bmatrix}- y^1 -\\ \vdots \\ - y^{N_Y} -\end{bmatrix} \in N_Y \times D respectively.

The matrix’ entries di,j=dist(i,j)2=xiyj2d_{i,j} = dist(i, j)^2 = ||x^i - y^j||^2 contain the squared distance between the i-th image of set XX and the j-th image of set YY. An easy way for calculating this is by observing that

xy2=(xy)(xy)=xx2xy+yy=x22xy+y2||x - y||^2 = (x - y) \cdot (x - y) = x \cdot x - 2x \cdot y + y \cdot y = ||x||^2 -2x \cdot y + ||y||^2

(where \cdot denotes the dot/scalar product of vectors)

and

XY=[x1xNX][y1yNY]=[x1y1x1yNYxNXy1xNXyNY] X Y^\top = \begin{bmatrix}- x^1 -\\ \vdots \\ - x^{N_X} -\end{bmatrix} \begin{bmatrix}| & & | \\ y^1 & \ldots & y^{N_Y} \\| & & | \end{bmatrix} = \begin{bmatrix} x^1 \cdot y^1 & \ldots & x^1 \cdot y^{N_Y} \\ \vdots & \ddots & \vdots \\ x^{N_X} \cdot y^1 & \ldots & x^{N_X} \cdot y^{N_Y} \end{bmatrix}

Now, starting off with the matrix 2XY-2XY^\top and adding to it’s columns the vector (x12xNX2)\begin{pmatrix}||x^1||^2\\ \vdots \\ ||x^{N_X}||^2\end{pmatrix}
and to it’s rows the vector (y12yNY2)\begin{pmatrix}||y^1||^2 & \ldots & ||y^{N_Y}||^2\end{pmatrix}

results in

[x122x1y1+y12x122x1yNY+yNY2xNX22xNXy1+y12xNX22xNXyNY+yNY2]=[d11d1NYdNX1dNXNY]\begin{bmatrix} ||x^1||^2 - 2 x^1\cdot y^1 + ||y^1||^2& \ldots & ||x^1||^2 - 2 x^1 \cdot y^{N_Y} + ||y^{N_Y}||^2 \\ \vdots & \ddots & \vdots \\ ||x^{N_X}||^2 - 2 x^{N_X} \cdot y^1 + ||y^1||^2 & \ldots & ||x^{N_X}||^2 - 2 x^{N_X} \cdot y^{N_Y} + ||y^{N_Y}||^2 \end{bmatrix} = \begin{bmatrix} d_{11} & \ldots & d_{1N_Y} \\ \vdots & \ddots & \vdots \\ d_{N_X1} & \ldots & d_{N_XN_Y} \\ \end{bmatrix}

Thankfully, this is easily done via numpy’s broadcasting. Numpy will automatically add row/column vectors to the rows/columns of a matrix, if the dimensions match up.

def knn_distances(X, Y):
    x_col = np.sum(np.square(X), axis=1, keepdims=True)
    y_row = np.sum(np.square(Y), axis=1, keepdims=True).T
    return x_col - 2 * X @ Y.T + y_row
    
dist = knn_distances(X_train, X_test)
# sanity check
i, j = 11, 4
x = X_train[i]
y = X_test[j]
print((np.linalg.norm(x - y)**2).astype('float32'))
print(dist[i][j])
697.50433
697.50433

Predicting

Given an image with an unknown label, we could simply calculate its distance to all samples with known labels and then assign the label of the ‘closest’ image. This is called 1-nearest neighbor.
In the case of 1-nearest neighbor, it is quite simple. After having calculated the pairwise distances for a given set of unlabeled data, looking at the smallest value in each of the matrix’ columns corresponds, by definition, to the image of the labeled dataset ‘closest’ to each image.

def predict_1nn(dist, Y_train):
    return Y_train[np.argmin(dist, axis=0)]

Taking into account the k closest samples, leads to K-nearest neighbors. For this, first, for each image we need to run vertically down the distance matrix and note the indices of the k-smallest numbers. Then we look at the k corresponding labels, count them and and pick the label which occurs the most.
To get the k-smallest indices of each column, we run np.argpartition() over the columns, which sorts our indices by the smallest values (it only sorts the first k-smallest, as indicated by the functions second argument).
Then we iterate through the columns to count the occurences of each label using np.bincount() and finally, we return the bin (in this case also the label) with the highest count, by applying np.argmax().

An example walking through argpartition and bincount
Code:
np.random.seed(1)
example = np.random.randint(10, size=(6,10))
print('distance matrix:')
print(example)
k = 3
k_smallest = np.argpartition(example, k, axis=0)[:k,:]
print('3-smallest indices: \n' + str(k_smallest))
print('corresponding image\'s label: \n' + str(Y_train[k_smallest]))
print('\ncategories:\n' + str(np.arange(10)) + '\n')
for k_small_j in k_smallest.T:
    labels = Y_train[k_small_j]
    print('bincount of ' + str(labels) + ':')
    bin_count = np.bincount(labels, minlength=10)
    print(bin_count)
    print('label -> {}'.format(bin_count.argmax()))
distance matrix:
[[5 8 9 5 0 0 1 7 6 9]
 [2 4 5 2 4 2 4 7 7 9]
 [1 7 0 6 9 9 7 6 9 1]
 [0 1 8 8 3 9 8 7 3 6]
 [5 1 9 3 4 8 1 4 0 3]
 [9 2 0 4 9 2 7 7 9 8]]
3-smallest indices: 
[[3 3 2 4 0 0 4 4 4 3]
 [2 4 5 5 3 5 0 2 3 2]
 [1 5 1 1 1 1 1 0 0 4]]
corresponding image's label: 
[[4 4 9 1 6 6 1 1 1 4]
 [9 1 1 1 4 1 6 9 4 9]
 [9 1 9 9 9 9 9 6 6 1]]

categories:
[0 1 2 3 4 5 6 7 8 9]

bincount of [4 9 9]:
[0 0 0 0 1 0 0 0 0 2]
label -> 9
bincount of [4 1 1]:
[0 2 0 0 1 0 0 0 0 0]
label -> 1
bincount of [9 1 9]:
[0 1 0 0 0 0 0 0 0 2]
label -> 9
bincount of [1 1 9]:
[0 2 0 0 0 0 0 0 0 1]
label -> 1
...

def knn_predict(k, dist, Y_train):
    k_smallest = np.argpartition(dist, k, axis=0)[:k,:] # indices of k-smallest distances
    Y_pred = [np.bincount(Y_train[idx]).argmax() # returns label with highest count
              for idx in k_smallest.T] # iterates through columns
    return np.array(Y_pred)
def evaluate_classifier(k, dist, Y_train, Y_test):
    Y_pred = knn_predict(k, dist, Y_train)
    return (Y_pred == Y_test).mean()
k_values = range(1, 15)
accuracies = []
for k in k_values:
    accuracies.append(evaluate_classifier(k, dist, Y_train, Y_test))
plt.plot(k_values, accuracies)
plt.title('k-nearest-neighbors evaluation on test set')
plt.xlabel('k')
plt.ylabel('accuracy')
plt.show()

png

Great! We can reach an accuracy of almost 30% of our predictions on the test set with a k value of around 7. Although, you’ve probably heard of overfitting before. Our variable k is a hyperparameter which controls the behavior of our model and we are currently tuning it to best fit the test set. This means that hypothetically if we had a bunch of other hyperparameters we could all tune at once, we might be able to reach a very high accuracy by some numerical happenstance. This is why it is common practice to not look at your test set until all tuning of hyperparameters is finished and to only use it as a final evaluation.

Cross validation

One way to find which hyperparameters work best for our model is by using cross validation, a technique where you shuffle, then split your training set into multiple folds, iterating through them by using one fold as a validation set and the rest as your new temporary training set. Like this we can evaluate our model many times, only using our training set and make sure we don’t accidentally overfit.
For this, we need to write a function that returns the indices of the new training and validation set.

def cross_fold_idx(N, folds):
    
    fold_size = int(N / folds)
    for i in range(folds):
    
        a = i * fold_size
        b = (i+1) * fold_size
        idx_valid = np.arange(a, b)
        idx_train = np.concatenate((np.arange(a), np.arange(b, N)))
        yield (idx_train, idx_valid)
example
N = 11
folds = 3
print(np.arange(N))
fold_size = int(N / folds)
for i in range(folds):
    a = i * fold_size
    b = (i+1) * fold_size
    print('{}-th validation set yield: [{},{})'.format(i, a, b))
[0 1 2 3 4 5 6 7 8 9 10]
0-th validation set yield: [0,3)
1-th validation set yield: [3,6)
2-th validation set yield: [6,9)

Now what’s left to do, is to:

  • shuffle the dataset
  • iterate over the folds
  • evaluate our model

We would like to have a function that returns the mean score and the standard deviation of our models evaluation on the multiple folds.

def cross_validate_knn(X, Y, folds, k):
    
    N = len(X)
    score = []
    std = []
    shuffle = np.random.permutation(N)
    for (idx_train, idx_valid) in cross_fold_idx(N, folds):
        idx_train = shuffle[idx_train]
        idx_valid = shuffle[idx_valid]
        X_train = X[idx_train]
        Y_train = Y[idx_train]
        X_valid = X[idx_valid]
        Y_valid = Y[idx_valid]
        dist = knn_distances(X_train, X_valid)
        accuracy = evaluate_classifier(k, dist, Y_train, Y_valid)
        
        score.append(accuracy)
    return np.mean(score), np.std(score)

Plots! Plots! Plots!

n_folds = 5
k_values = [1, 2, 3, 5, 8, 10, 15]
score_plot = []
std_plot = []
for k in k_values:
    
    score_mean, std = cross_validate_knn(X_train, Y_train, n_folds, k)
    
    score_plot.append(score_mean)
    std_plot.append(std)
plt.errorbar(k_values, score_plot, yerr=std_plot)
plt.title('Cross-validation k-nearest neighbor')
plt.ylabel('mean accuracy')
plt.xlabel('k')
plt.show()

png

There is no immediate reason as to why an algorithm such as k-nearest-neighbors would even show any conclusive results on the raw input data. Things like camera perspective transformation, object displacement/rotation, color tints should make it very hard to classify images by comparing single pixel values. Yet, we come close to a 30% accuracy (compared to 10%, if the algorithm was plain random). The performance of knn depends on the representation of the data. If we had a better feature map, which could transform our data in to highly abstracted and meaningful features, then our algorithm would show a better performance. Conversely, when parsing a neural network’s decision, for example, often the maximum value of the class probabilities is chosen (which is 1-nearest-neighbor). If we were to remove the last layer and use the neural network’s feature representation combined with knn, we could see a performance boost (see Christopher Olah’s Blog for more details).