In this tutorial, we will apply the k-NN classifier to the post-PCA shape space of WBC nuclei images that we generated in the previous tutorial. To do so, we will need a statistical software framework that includes classification algorithms. There are a number of popular platforms available, but we will choose Weka, developed at the University of Waikato in New Zealand, since it is relatively light-weight and easy to get running quickly.
To install Weka, follow the instructions provided at the Weka wiki.
Converting a shape space file
To convert our current PCA pipeline coordinates to a format to be used in Weka, we need to convert the
WBC_PCA.csv file that we produced in the previous tutorial and that contains the coordinates of every image in the post-PCA shape space into the
arff format used by Weka. If you have not completed the previous tutorial, or you would like to skip to the next section of this tutorial, we provide the completed file for download here.
Open Weka and navigate to
Tools --> ArffViewer.
Then navigate to
File --> Open.
Files of Type option to
CSV data files.
WBC_PCA.csv file in your
Step4_Visualization folder and click
Once all the data is loaded on screen, navigate to
File --> Save as ….
.csv extension in the
File Name field, and click
As a result, our PCA pipeline coordinates have now been converted to the file format that Weka accepts for further classification. This file should be saved as
WBC_PCA.arff in the
Step4_Visualization subfolder of the
Now that we have the PCA dataset in the correct format, click
Exit to return to the Weka home screen.
Running our first classifier
You should now be at the
Weka GUI Chooser window that shows at the application’s startup. Under
Explorer to bring up the
Weka Explorer window. This is the main window that we will use to run our classifier.
Next, we need to load our
WBC_PCA.arff file that we just created. At the top left of the window, click
Open file... Navigate to the location of your
WBC_PCA.arff file (the default location would be
Desktop/WBC_PCAPipeline/Step4_Visualization). When we do so, we should see the data loaded into the window, as shown in the figure below.
We want to ensure that Weka only considers the variables that are relevant for classifying the images by family. For this analysis, we won’t need the
FILENAME name or the
TYPE variables (if we were to include them, Weka would try to use them as one of the coordinates of our shape space vectors). So, click the checkboxes next to these two vectors, and click
Remove to exclude them from consideration.
Let’s classify! Click the
Classify tab at the top of the explorer window. Near the top of the window you will see a button that says
ZeroR next to it. This button will allow us to select our classifier.
If you’re curious what
ZeroR means, it is the clown classifier from the main text that assigns every object to the class containing the most elements. Let’s not use this classifier! Instead, click
Choose, which will bring up a menu of folders as shown below.
The k-NN classifer is contained under
lazy > IBK. Select
IBK, and you will be taken back to the explorer window, except that next to
Choose you should now see
IBK followed by a collection of parameters. The only parameter that we need for k-NN is the value of k (the number of nearest neighbors to consider when assigning a class to an object), which by default is set to 1 as indicated by
Test Options, we see
Cross-validation is selected, which is what we want. Let us leave the number of folds equal to 10, the default value.
More options, we will see
(Num) Var344. This is the variable that Weka will use to assign classes; rather, we would like Weka to classify objects by family. So, select this field, scroll up to the top, and select
Num indicates a numeric variable, and
Nom indicates a nominal variable (meaning that it corresponds to a name).
Now for the magic moment. Click
Start. The classifier should run very quickly, and the results will show in the main window to the right and are reproduced below.
The results are horrible! Every image in our dataset has been assigned as a lymphocyte. What could have gone wrong?
Reducing the number of dimensions considered
Remember when we said that weird things happen in multi-dimensional space? The above result is one of those things. For some reason, every object in the dataset is closest to a lymphocyte. We could dig into the gritty details of the data to try and determine why this is the case, but instead, we will mutter something about the curse of dimensionality.
When we used CellOrganizer to build a shape space with PCA, it produced a hyperplane with 344 dimensions (one fewer than the total number of images), which is far more than we need. The good news is that one of the features of PCA is that if we would instead like a hyperplane with some smaller number of dimensions d, then we only need to consider the first d coordinates of every point in the space.
In our case, we will simply remove most of the variables under consideration by taking d = 20. To do so, click on the
Preprocess tab. Under
All, and then unselect
FAMILY and the variables
Remove to ignore the other variables.
Removing variables is always counterintuitive to a three-dimensional mind, but let us see what happens when we run the classifier again. Click the
Classify tab, and you will see that
(Num) Var20 is selected as the variable to use for classification. Select
(Nom) FAMILY and click
Start. In our run, this produces the following confusion matrix in the output window.
This is much better! The classifier seems to be performing particularly well on granulocytes. So, if removing some variables was a good thing, let’s remove a few more. Head back to
Var20, and run the classifier again. Our run yields the following updated confusion matrix.
We are getting a little better! If we remove
Var15, you can verify that we obtain the following confusion matrix.
In each step, our confusion matrix appears to be a little better, and the metrics that we introduced in the main text improve as well. In the
Classifier output window, you can see that the accuracy has increased to 84.3%, while the weighted average of precision and recall over all three classes have increased to 0.857 and 0.843, respectively.
All this dimension reduction may make us wonder how far we should take it — should we reduce everything down to a single dimension? Yet if we remove
Var10, we see that our confusion matrix gets a little worse:
And if we take the number of dimensions down to three, it gets a little worse still:
We have therefore replicated an instance of a very deep fact in data science, which is that there is typically a “Goldilocks” value in the number of dimensions we should use for our PCA hyperplane, at which the algorithm is performing optimally. In the case of this WBC image dataset, that sweet spot value is around 10.
Note: If anything is still unclear about using Weka and exploring its output, Jen Golbeck made an excellent Youtube video that you may like to check out.
STOP: There are two other considerations that we should take into account: the value of k in our k-nearest neighbors approach (which has defaulted to 1) and the number of folds f used (which has defaulted to 10). We encourage you to continue running the k-NN classifier for a few different values of k and f (which can range from 2 to 365). What do you find? Does it match your intuition? And what happens if we try a different classifier entirely?
Subclassifying images by cell type
We classified our WBC images by family, but granulocytes further subdivide into three classes (basophils, eosinophils, and neutrophils). This means that we could just as well have classified images into five categories corresponding to cell type.
To do so, click the
Preprocess tab at the top of the Weka explorer window. Click
Open File again, and open your
WBC_PCA.arff file. (It has not been modified by Weka.) This time, under
FAMILY, and the variables
Classify, and again run k-NN with k = 1 and the number of folds equal to 10, making sure to select
(Nom) TYPE as your variable to classify. As we return to the main text, we ask you to reflect on the results.
STOP: What are the accuracy, precision, and recall of this classifier? How does the confusion matrix compare to the one that we produced for three classes? From a biological perspective, why do you think that the algorithm is struggling?