Chapter 6 Classification

This chapter continues our discussion of supervised learning by introducing the classification tasks. Like regression, we will focus on the conditional distribution of the response.

Specifically, we will discuss:

  • The setup for the classification task.
  • The Bayes classifier and Bayes error.
  • Estimating conditional probabilities.
  • Two simple metrics for the classification task.

This chapter is currently under construction. While it is being developed, the following links to the STAT 432 course notes.

6.1 R Setup and Source

library(tibble)     # data frame printing
library(dplyr)      # data manipulation

library(knitr)      # creating tables
library(kableExtra) # styling tables

Additionally, objects from ggplot2, GGally, and ISLR are accessed. Recall that the Welcome chapter contains directions for installing all necessary packages for following along with the text. The R Markdown source is provided as some code, mostly for creating plots, has been suppressed from the rendered document that you are currently reading.

6.2 Data Setup

  • TODO: Add data setup example.

6.3 Mathematical Setup

6.4 Example

\(X = 1\) \(X = 2\) \(X = 3\) \(X = 4\)
\(Y = A\) 0.12 0.01 0.04 0.14
\(Y = B\) 0.05 0.03 0.10 0.15
\(Y = C\) 0.09 0.06 0.08 0.13
\(X = 1\) \(X = 2\) \(X = 3\) \(X = 4\)
0.26 0.1 0.22 0.42
\(Y = A\) \(Y = B\) \(Y = C\)
0.31 0.33 0.36

6.5 Bayes Classifier

\[ p_k(x) = P\left[ Y = k \mid X = x \right] \]

\[ C^B(x) = \underset{k \in \{1, 2, \ldots K\}}{\text{argmax}} P\left[ Y = k \mid X = x \right] \]

Warning: The Bayes classifier should not be confused with a naive Bayes classifier. The Bayes classifier assumes that we know \(P\left[ Y = k \mid X = x \right]\) which is almost never known in practice. A naive Bayes classifier is a method we will see later that learns a classifier from data.77


6.5.1 Bayes Error Rate

\[ 1 - \mathbb{E}_X\left[ \underset{k}{\text{max}} \ P[Y = k \mid X = x] \right] \]

6.6 Building a Classifier

\[ \hat{p}_k(x) = \hat{P}\left[ Y = k \mid X = x \right] \]

\[ \hat{C}(x) = \underset{k \in \{1, 2, \ldots K\}}{\text{argmax}} \hat{p}_k(x) \]

  • TODO: first estimation conditional distribution, then classify to label with highest probability
  • TODO: do it in r with knn, tree, or glm / nnet
  • TODO: note about estimating probabilities vs training a classifier

6.7 Modeling

6.7.1 Linear Models

  • TODO: use nnet::multinom
    • in place of glm()? always?

6.7.2 k-Nearest Neighbors

  • TODO: use caret::knn3()

6.7.3 Decision Trees

  • TODO: use rpart::rpart()

6.8 Classification Metrics

Like regression, classification metrics will depend on the learned function (in this case, the learned classifier \(\hat{C}\)) as well as an additional dataset that is being used to make classifications with the learned function.78 Like regression, this will make the notation for a “simple” metric like accuracy look significantly more complicated than it truly is, but it will be helpful to make the dependency on the datasets explicit.

6.8.1 Misclassification

\[ \text{miclass}\left(\hat{C}_{\texttt{set_f}}, \mathcal{D}_{\texttt{set_D}} \right) = \frac{1}{n_{\texttt{set_D}}}\displaystyle\sum_{i \in {\texttt{set_D}}}^{} I\left(y_i \neq \hat{C}_{\texttt{set_f}}({x}_i)\right) \]

calc_misclass = function(actual, predicted) {
  mean(actual != predicted)
}

6.8.2 Accuracy

\[ \text{accuracy}\left(\hat{C}_{\texttt{set_f}}, \mathcal{D}_{\texttt{set_D}} \right) = \frac{1}{n_{\texttt{set_D}}}\displaystyle\sum_{i \in {\texttt{set_D}}}^{} I\left(y_i = \hat{C}_{\texttt{set_f}}({x}_i)\right) \]

calc_accuracy = function(actual, predicted) {
  mean(actual == predicted)
}

Plugging in the appropriate dataset will allow for calculation of train, test, and validation metrics.


  1. This author feels that the use of “Bayes” in “Bayes classifier” is actually confusing because you don’t actually need to apply Bayes’ theorem to state or understand the result. Meanwhile, Bayes’ theorem is needed to understand the naive Bayes classifier. Oh well. Naming things is hard.↩︎

  2. The metrics also implicitly depend on the dataset used to learn the classifier.↩︎