# Chapter 6 Classification

This chapter continues our discussion of **supervised learning** by introducing the **classification** tasks. Like regression, we will focus on the conditional distribution of the response.

Specifically, we will discuss:

- The setup for the
**classification**task. - The
**Bayes classifier**and**Bayes error**. - Estimating
**conditional probabilities**. - Two simple
**metrics**for the classification task.

This chapter is currently under construction. While it is being developed, the following links to the STAT 432 course notes.

## 6.1 R Setup and Source

```
library(tibble) # data frame printing
library(dplyr) # data manipulation
library(knitr) # creating tables
library(kableExtra) # styling tables
```

Additionally, objects from `ggplot2`

, `GGally`

, and `ISLR`

are accessed. Recall that the Welcome chapter contains directions for installing all necessary packages for following along with the text. The R Markdown source is provided as some code, mostly for creating plots, has been suppressed from the rendered document that you are currently reading.

**R Markdown Source:**`classification.Rmd`

## 6.2 Data Setup

- TODO: Add data setup example.

## 6.3 Mathematical Setup

## 6.4 Example

\(X = 1\) | \(X = 2\) | \(X = 3\) | \(X = 4\) | |
---|---|---|---|---|

\(Y = A\) | 0.12 | 0.01 | 0.04 | 0.14 |

\(Y = B\) | 0.05 | 0.03 | 0.10 | 0.15 |

\(Y = C\) | 0.09 | 0.06 | 0.08 | 0.13 |

\(X = 1\) | \(X = 2\) | \(X = 3\) | \(X = 4\) |
---|---|---|---|

0.26 | 0.1 | 0.22 | 0.42 |

\(Y = A\) | \(Y = B\) | \(Y = C\) |
---|---|---|

0.31 | 0.33 | 0.36 |

## 6.5 Bayes Classifier

\[ p_k(x) = P\left[ Y = k \mid X = x \right] \]

\[ C^B(x) = \underset{k \in \{1, 2, \ldots K\}}{\text{argmax}} P\left[ Y = k \mid X = x \right] \]

**Warning:** The Bayes classifier should not be confused with a naive Bayes classifier. The Bayes classifier assumes that we know \(P\left[ Y = k \mid X = x \right]\) which is almost never known in practice. A naive Bayes classifier is a method we will see later that learns a classifier from data.^{77}

### 6.5.1 Bayes Error Rate

\[ 1 - \mathbb{E}_X\left[ \underset{k}{\text{max}} \ P[Y = k \mid X = x] \right] \]

## 6.6 Building a Classifier

\[ \hat{p}_k(x) = \hat{P}\left[ Y = k \mid X = x \right] \]

\[ \hat{C}(x) = \underset{k \in \{1, 2, \ldots K\}}{\text{argmax}} \hat{p}_k(x) \]

- TODO: first estimation conditional distribution, then classify to label with highest probability
- TODO: do it in r with knn, tree, or glm / nnet
- TODO: note about estimating probabilities vs training a classifier

## 6.7 Modeling

### 6.7.1 Linear Models

- TODO: use
`nnet::multinom`

- in place of
`glm()`

? always?

- in place of

### 6.7.2 k-Nearest Neighbors

- TODO: use
`caret::knn3()`

### 6.7.3 Decision Trees

- TODO: use
`rpart::rpart()`

## 6.8 Classification Metrics

Like regression, classification metrics will depend on the learned function (in this case, the learned classifier \(\hat{C}\)) as well as an additional dataset that is being used to make classifications with the learned function.^{78} Like regression, this will make the notation for a “simple” metric like accuracy look significantly more complicated than it truly is, but it will be helpful to make the dependency on the datasets explicit.

### 6.8.1 Misclassification

\[ \text{miclass}\left(\hat{C}_{\texttt{set_f}}, \mathcal{D}_{\texttt{set_D}} \right) = \frac{1}{n_{\texttt{set_D}}}\displaystyle\sum_{i \in {\texttt{set_D}}}^{} I\left(y_i \neq \hat{C}_{\texttt{set_f}}({x}_i)\right) \]

### 6.8.2 Accuracy

\[ \text{accuracy}\left(\hat{C}_{\texttt{set_f}}, \mathcal{D}_{\texttt{set_D}} \right) = \frac{1}{n_{\texttt{set_D}}}\displaystyle\sum_{i \in {\texttt{set_D}}}^{} I\left(y_i = \hat{C}_{\texttt{set_f}}({x}_i)\right) \]

Plugging in the appropriate dataset will allow for calculation of train, test, and validation metrics.

This author feels that the use of “Bayes” in “Bayes classifier” is actually confusing because you don’t actually need to apply Bayes’ theorem to state or understand the result. Meanwhile, Bayes’ theorem is needed to understand the naive Bayes classifier. Oh well. Naming things is hard.↩︎

The metrics also implicitly depend on the dataset used to learn the classifier.↩︎