# An Overview of Classification and k-Nearest Neighbors

## The classification problem

Labeling images of WBCs according to their family is a specific instance of an ubiqitous problem in data science, in which we wish to classify each object in a given dataset into one of k classes.

In our ongoing example, the data are images of WBCs, and the classes are the three main families of WBCs (granulocytes, lymphocytes, and monocytes). To take a different example, our data could be tumor genomes sequenced from cancer patients, which we want to classify based on which therapeutic should be prescribed for the patient. Or the data may be the past sales behavior of shoppers, who we want to classify into two classes based on a prediction of whether they will buy a new product.

## The iris flower dataset

A classical dataset commonly used for motivating classification is the iris flower dataset, which was compiled by Edgar Anderson12, and which Ronald Fisher used in a seminal paper on classification in 19363. Anderson took measurements from 150 iris flowers, equally divided among three species (see figure below).

Anderson measured four attributes, or features, of each of the flowers in his dataset: both the width and height of the flower’s petal, and both the width and height of the flower’s sepal (a green offshoot beneath the petals). The features and species labels for twelve of the flowers in the iris flower dataset are shown in the table below (click here for the full dataset). Fisher considered whether it was possible to classify each flower according to its species using only the features Anderson had measured.

Sepal length (cm) Sepal width (cm) Petal length (cm) Petal width (cm) Species
5.1 3.5 1.4 0.2 I. setosa
4.9 3.0 1.4 0.2 I. setosa
4.7 3.2 1.3 0.2 I. setosa
4.6 3.1 1.5 0.2 I. setosa
7.0 3.2 4.7 1.4 I. versicolor
6.4 3.2 4.5 1.5 I. versicolor
6.9 3.1 4.9 1.5 I. versicolor
5.5 2.3 4.0 1.3 I. versicolor
6.3 3.3 6.0 2.5 I. virginica
5.8 2.7 5.1 1.9 I. virginica
7.1 3.0 5.9 2.1 I. virginica
6.3 2.9 5.6 1.8 I. virginica

A table containing values of the four features for twelve members of the iris flower dataset. The complete dataset was accessed from the University of California, Irvine Machine Learning Repository].

STOP: What characteristics do flowers from each species tend to share in terms of the four features in the table above?

## From flowers to vectors

If we were to use only two features in the Iris flower dataset, then a flower’s feature values x and y could be represented as a point in two-dimensional space (x, y). The figure below shows such a plot for the features of petal length (x-axis) and petal width (y-axis).

Petal width (x-axis) plotted against width (y-axis) for each of the flowers in the iris flower dataset, colored by species. There are not fifty points corresponding to every species because some flowers have the same petal length and width.

Note how stark the pattern in the above figure is. Even though we chose only two features from the iris flowers, the points associated with the flowers can be divided into three main clusters by species. In other words, nearby points tend to correspond to flowers from the same species.

If we were to use all four features for the iris dataset, then every flower would be represented by a point in four-dimensional space. For example, the first flower in our initial table of iris features would be represented by the point (5.1, 3.5, 1.4, 0.2). In general, when classifying a collection of data with n features, each element in the dataset can be represented by a feature vector of length n, whose i-th value corresponds to its value for the i-th feature.

## Classifying unknown elements with k-nearest neighbors

For the iris flower dataset, recall our observation that data points were more likely to belong to the same class if they were nearby. Our hope is that this fact is true for other datasets, that elements from the same class will have feature vectors that are close in n-dimensional space. If so, then we can classify a data point whose class is unknown by determining which data points with known classification it is near.

STOP: Consider the point with unknown class (gray) in the figure below. Should it be assigned to the class of the green points or to the class of the blue points?

An unknown point (gray) along with a collection of nearby points belonging to two classes, colored green and blue.

The preceding question indicates that classifying points can be surprisingly open-ended. Because of this freedom, researchers have devised a variety of different approaches for classifying data given data with known classes.

We will discuss a simple but powerful classification algorithm called k-nearest neighbors, or k-NN4. In k-NN, we fix a positive integer k in advance, which will be used for classification of all points. Then, for each point with unknown class, we assign it the class possessed by the largest number of its k closest neighbors.

In the ongoing example, if we were using k equal to 1, then we would assign the unknown point to the green class (see figure below).

When using k-NN with k equal to 1, we classify an unknown point according to the point of known class that is nearest; for this reason, the unknown point above would be assigned to the green class.

However, with the same data and k equal to 4, the figure below shows that a majority of the k nearest neighbors are blue, and so we classify the unknown point as blue. This example reinforces a theme of this course, that the results of an algorithm can be sensitive to our choice of parameters.

When using k-NN with k equal to 4, k-NN will classify the unknown point as blue, since three of its four closest neighbors are blue.

STOP: When k = 2 or k = 6 for the above figure, note that we obtain a tie in the number of points from each known class belonging to the k nearest neighbors of a point with unknown class. How could we break ties in k-NN?

In the more general case in which feature vectors have length n, we can determine which points are nearest to a given point by using the Euclidean distance, which generalizes the distance between two points in two-dimensional space to vectors in n-dimensional space. Say that we have the vectors x = (x1, x2, …, xn) and y = (y1, y2, …, yn). Then the Euclidean distance between them is given by the sum of squares of differences between corresponding vector elements:

$d(\mathbf{x}, \mathbf{y}) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n-y_n)^2}$

We now have learned how to use k-NN to classify feature vectors with unknown classes given vectors with known classes. There is just one problem: how can we convert an image of a WBC into a vector?

1. Anderson E (1935) The irises of the Gaspe Peninsula. Bulletin of the American Iris Society 59: 2-5.

2. Anderson E (1936) The species problem in Iris. Annals of the Missouri Botanical Garden 23(3):457-509. Available online

3. Fisher RA (1936) The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7(2):179-188. Available online

4. Fix E. and Hodges J.L. (1951) Discriminatory Analysis, Nonparametric Discrimination: Consistency Properties. Technical Report 4, USAF School of Aviation Medicine, Randolph Field. Available online