# Chapter 1 Machine Learning Overview

This is a book about machine learning, so let’s try to define machine learning in this chapter.

Specifically, we’ll discuss:

- What is
**machine learning**? - The difference between
**supervised learning**and**unsupervised learning**. - The difference between
**classification**and**regression**.

## 1.1 What is Machine Learning?

Machine learning (ML) is about **learning functions from data**.

^{24}That’s it. Really. Pretty boring, right?

^{25}

To quickly address some buzzwords that come up when discussing machine learning:

**Deep learning**is a subset of machine learning.**Artificial Intelligence**(AI) overlaps machine learning but has much loftier goals. In general, if someone claims to be using AI, they are not. They’re probably using function learning! For example, we will learn logistic regression in this course. People in marketing might call that AI! Someone who understands ML will simply call it function learning. Don’t buy the hype! We don’t need to call simple methods AI to make them effective.^{26}- Machine learning is not
**data science**. Data science sometimes uses machine learning. - Does
**big data**exist? If it does, I would bet a lot of money that you haven’t seen it, and probably won’t see it that often. **Analytics**is just a fancy word for doing data analysis. Machine learning can be used in analyses! When it is, it is often called “Predictive Analytics.”

What makes machine learning interesting are the uses of these learned functions. We could develop functions that have applications in a wide variety of fields.

In **medicine**, we could develop a function that helps detect skin cancer.

- Input: Pixels from an image of mole
- Output: A probability that the mole is cancerous

In **sport analytics**, we could develop a function that helps determine player salary.

- Input: Lifetime statistics of an NBA player
- Output: An estimate of player’s salary

In **meteorology**, we could develop a function to predict the weather.

- Input: Historic weather data in Champaign, IL
- Output: A probability of rain tomorrow in Champaign, IL

In **political science** we could develop a function that predicts the mood of the president.

- Input: The text of a tweet from President Donald Trump
- Output: A prediction of Donald’s mood (happy, sad, angry, etc)

In **urban planning** we could develop a function that predicts the rental prices of Airbnbs.

- Input: The attributes of the location for rent
- Output: An estimate of the rent of the property

How do we learn these functions? By looking at many previous examples, that is, data! Again, we will learn functions from data. That’s what we’re here to do.

## 1.2 Machine Learning Tasks

When doing machine learning, we will classify our *tasks* into one of two categories, **supervised** or **unsupervised** learning.^{27} Within these two broad categories of ML tasks, we will define some specifics.

### 1.2.1 Supervised Learning

In supervised learning, we want to “predict” a specific *response variable*. (The response variable might also be called the target or outcome variable.) In the following examples, this is the `y`

variable. Supervised learning tasks are called **regression** if the response variable is *numeric*. If a supervised learning tasks has a *categorical* response, it is called **classification**.

#### 1.2.1.1 Regression

In the regression task, we want to predict **numeric** response variables. The non-response variables, which we will call the feature variables, or simply **features** can be either *categorical* or *numeric*.^{28}

x1 | x2 | x3 | y |
---|---|---|---|

A | -0.66 | 0.48 | 14.09 |

A | 1.55 | 0.97 | 2.92 |

A | -1.19 | -0.81 | 15.00 |

A | 0.15 | 0.28 | 9.29 |

B | -1.09 | -0.16 | 17.57 |

B | 1.61 | 1.94 | 2.12 |

B | 0.04 | 1.72 | 8.92 |

A | 1.31 | 0.36 | 4.40 |

C | 0.98 | 0.30 | 4.40 |

C | 0.88 | -0.39 | 4.52 |

With the data above, our goal would be to learn a function that takes as input values of the three features (`x1`

, `x2`

, and `x3`

) and returns a prediction (best guess) for the true (but usually unknown) value of the response `y`

. For example, we could obtain some “new” data that does not contain the response.

x1 | x2 | x3 |
---|---|---|

B | -0.85 | -2.41 |

We would then pass this data to our function, which would return a prediction of the value of `y`

. Stated mathematically, our prediction will often be an estimate of the conditional mean of \(Y\), given values of the \(\boldsymbol{X}\) variables.

\[ \mu(\boldsymbol{x}) = \mathbb{E}[Y \mid \boldsymbol{X} = \boldsymbol{x}] \]

In other words, we want to learn this function, \(\mu(\boldsymbol{x})\). Much more on this later.^{29}

#### 1.2.1.2 Classification

Classification is similar to regression, except it considers **categorical** response variables.

x1 | x2 | x3 | y |
---|---|---|---|

Q | 0.46 | 5.42 | B |

Q | 0.72 | 0.83 | C |

Q | 0.93 | 5.93 | B |

Q | 0.26 | 5.68 | A |

P | 0.46 | 0.49 | B |

P | 0.94 | 3.09 | B |

P | 0.98 | 2.34 | C |

P | 0.12 | 5.43 | C |

Q | 0.47 | 2.68 | B |

P | 0.56 | 5.02 | B |

As before we want to learn a function from this data using the same inputs, except this time, we want it to output one of `A`

, `B`

, or `C`

for predictions of the `y`

variable. Again, consider some new data:

x1 | x2 | x3 |
---|---|---|

P | 0.96 | 5.33 |

While ultimately we would like our function to return one of `A`

, `B`

, or `C`

, what we actually would like is an intermediate return of probabilities that `y`

is `A`

, `B`

, or `C`

. In other words, we are attempting to estimate the conditional probability that \(Y\) is each of the possible categories, given values of the \(\boldsymbol{X}\) values.

\[ p_k(\boldsymbol{x}) = P\left[ Y = k \mid \boldsymbol{X} = \boldsymbol{x} \right] \]

We want to learn this function, \(p_k(\boldsymbol{x})\). Much more on this later.

### 1.2.2 Unsupervised Learning

Unsupervised learning is a very broad task that is rather difficult to define. Essentially, it is learning without a response variable. To get a better idea about what unsupervised learning is, consider some specific tasks.

x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|

2.74 | 0.46 | 5.42 | 4.43 | 2.28 |

2.81 | 0.72 | 0.83 | 4.87 | 2.61 |

0.86 | 0.93 | 5.93 | 2.33 | 0.22 |

2.49 | 0.26 | 5.68 | 4.11 | 5.84 |

1.93 | 0.46 | 0.49 | 0.02 | 2.59 |

-8.44 | -9.06 | -6.91 | -5.00 | -4.25 |

-7.79 | -9.02 | -7.66 | -9.96 | -4.67 |

-9.60 | -9.88 | -4.57 | -8.75 | -6.16 |

-8.03 | -9.53 | -7.32 | -4.56 | -4.17 |

-7.88 | -9.44 | -4.98 | -6.33 | -6.29 |

#### 1.2.2.1 Clustering

Clustering is essentially the task of **grouping** the observations of a dataset. In the above data, can you see an obvious grouping? (Hint: Compare the first five observations to the second five observations.) In general, we try to group observations that are similar.

#### 1.2.2.2 Density Estimation

Density estimation tries to do exactly what the name implies, estimate the density. In this case, the joint density of \(X_1, X_2, X_3, X_4, X_5\). In other words, we would like to learn the **function** that generated this data.^{30}

#### 1.2.2.3 Outlier Detection

Consider some new data:

x1 | x2 | x3 | x4 | x5 |
---|---|---|---|---|

67 | 66.68 | 66.26 | 69.49 | 70 |

Was this data generated by the same process as the data above? If we don’t believe so, we would call it an outlier.

## 1.3 Open Questions

The two previous sections were probably more confusing than helpful. But of course, because we haven’t started learning yet! Hopefully, you are currently pondering one very specific question:

*How*do we**learn**functions from data?

That’s what this text will be about! We will spend a lot of time on this question. It is what us statisticians call fitting a model. On some level the answer is: look at a bunch of old data before predicting on new data.

While we will dedicate a large amount of time to answering this question, sometimes, some of the details might go unanswered. Since this is an introductory text, we can only go so far. However, as long as we answer another question, this will be OK.

*How*do we**evaluate**how well learned functions work?

This text places a high priority on being able to **do** machine learning, specifically do machine learning in R. You can actually do a lot of machine learning without fully understanding how the learning is taking place. That makes the *evaluation* of ML models extremely important.

This is a purposefully narrow view of machine learning. Obviously there’s a lot more to ML, but the author believes this statement will help readers understand that the methods learned in this text are simply tools that must be evaluated as a part of a larger analysis.↩

An alternative title of this book could be:

*ML is boring but useful*.↩There are certainly people who do legitamtely work on AI, but the strong statement is made here to try to downplay the hype.↩

There are technically other tasks such as reinforcement learning and semi-supervised learning, but they are outside the scope of this text. To understand these advanced tasks, you should first learn the basics!↩

These variables are often called “predictor” variables, but we find this nomenclature to be needlessly confusing.↩

For now, just understand that we are able to make a “best guess” for a new observation.↩

You could take the position that this is the

**only**machine learning task, and all other tasks are subset of this task. We’ll hold off on explaining this for a while.↩