Basics of Statistical Learning
Welcome
Welcome to Basics of Statistical Learning! What a boring title! The title was chosen to mirror that of the University of Illinois at Urbana-Champaign course STAT 432 - Basics of Statistical Learning. That title was chosen to meet certain University course naming conventions, hence the boring title. A more appropriate title would be a broad introduction to machine learning from the perspective of a statistician who uses R1 and emphasizes practice over theory. This is more descriptive, still boring, and way too many words. Anyway, this “book” will be referred to as BSL for short.
This chapter will outline the who, what, when, where, why, and how2 of this book, but not necessarily in that order.
0.1 Who?
0.1.1 Readers
This book is targeted at advanced undergraduate or first year MS students in Statistics who have no prior machine learning experience. While both will be discussed in great detail, previous experience with both statistical modeling and R are assumed. In other words, this books is for students in STAT 4323.
If you are reading this book but are not involved in STAT 432, we assume:
- a semester of calculus based probability and statistics
- familiarity with linear algebra
- enough understanding of linear models and R to be able to use R’s formula syntax to specify models
0.1.3 Acknowledgements
The following is a (likely incomplete) list of helpful contributors. This book was also influenced by the helpful contributors to R4SL.
- Jae-Ho Lee - STAT 432, Fall 2019
- W. Jonas Reger - STAT 432, Spring 2020
Please see the CONTRIBUTING document on GitHub for details on interacting with this project. Pull requests encouraged!
0.2 What?
Well, this is a book. But you already knew that. More specifically, this is a book for use in STAT 432. But if you are reading this chapter, you’re either not in STAT 432, or new to STAT 432, so that isn’t really helpful. This is a book about machine learning. But this is too vague a description. It is probably most useful to describe the desired outcome as a result of reading this book. In a single sentence:
After engaging with BSL, readers should feel comfortable training predictive models and evaluating their use as part of larger systems or data anlyses.
This sentence is both too specific and too general, so some additional comments about what will and will not be discussed in this text:
- An ability to train models will be emphasized over the ability to understand models at a deep theoretical level. This is not to say that theory will be completely ignored, but some theory will be sacrificed for practicality. Theory5 will be explored especially when it motivates use in practice.
- Evaluation of models is emphasized as the author takes the position6 that in practice it is more important to know if your model works than how your model works.
- Rather than making an attempt to illustrate all possible modeling techniques7, a small set of techniques are emphasized: linear models, nearest neighbors, and decision trees. These will initially serve as examples for theory discussions, but will then become the building blocks for more complex techniques such as lasso, ridge, and random forests.
- While the set of models discussed will be limited8, the emphasis on an ability to train and evaluate these models should allow a reader to train and evaluate any model in a predictive context, provided it is implemented in a statistical computing environment9.
For a better understanding of the specific topics covered, please see the next chapter which serves as an overview of the text.
To be clear: This book is not an exhaustive treatment of machine learning. If this is your first brush with machine learning, hopefully it is not your last!
0.3 Why?
Why does this book exists? That is a very good question, especially given the existence of An Introduction to Statistical Learning10, the immensely popular book11 by James, Witten, Hastie, and Tibshirani. The author of this text believes ISL is a great text12, so much so that he would suggest that any readers of BSL also read all of ISL13. Despite this, a book that was more inline with the content and goals of STAT 43214 was conceived by the author, so here we are.
Why does STAT 432 exist? Short answer: to add a course on machine learning to the undergraduate Statistics curriculum at the University of Illinois. The long story is long, but two individuals deserve credit for their work in the background:
- Ehsan Bokhari for introducing the author to ISL and suggesting that it would make a good foundation for an undergraduate course.
- Jeff Douglas for actually getting the pilot version of STAT 432 off the ground15.
0.4 Where?
Currently, this text is used exclusively16 for STAT 43217 at the University of Illinois at Urbana-Champaign.
The text can be accessed from https://statisticallearning.org/.
0.5 When?
This book was last updated on: 2021-04-19.18
0.6 How?
Knowing a bit about how this book was built will help readers better interact with the text.
0.6.1 Build Tools
This book is authored using Bookdown19, built using Travis-CI, and hosted via GitHub pages. Details of this setup can be found by browser the relevant GitHub repository.20
Users that are familiar with these tools, most importantly GitHub, are encouraged to contribute. As noted above, please see the CONTRIBUTING document on GitHub for details on interacting with this project.
0.6.2 Active Development
This “book” is under active development. Literally every element of the book is subject to change, at any moment. This text, BSL, is the successor to R4SL, an unfinished work that began as a supplement to Introduction to Statistical Learning, but was never finished. (In some sense, this book is just a fresh start due to the author wanting to change the presentation of the material. The author is seriously worried that he will encounter the second-system effect.21
Because this book is written with a course in mind, that is actively being taught, often out of convenience the text will speak directly to the students of that course. Thus, be aware that any references to a “course” are a reference to STAT 432 @ UIUC.
Since this book is under active development you may encounter errors ranging from typos, to broken code, to poorly explained topics. If you do, please let us know! Better yet, fix the issue yourself!22 If you are familiar with R Markdown and GitHub, pull requests are highly encouraged!. This process is partially automated by the edit button in the top-left corner of the html version. If your suggestion or fix becomes part of the book, you will be added to the acknowledgments in this chapter this chapter. We’ll also link to your GitHub account, or personal website upon request. If you’re not familiar with version control systems feel free to email the author, dalpiaz2 AT illinois DOT edu
.23 See additional details in the Acknowledgments section above.
While development is taking place, you may see “TODO” items scattered throughout the text. These are mostly notes for internal use, but give the reader some idea of what development is still to come.
0.6.3 Packages
The following will install all R packages needed to follow along with the text.
0.6.4 License
R Core Team, R: A Language and Environment for Statistical Computing (Vienna, Austria: R Foundation for Statistical Computing, 2016), https://www.R-project.org/↩︎
STAT 432 is also cross-listed as ASRM 451, but we will exclusively refer to STAT 432 for simplicity.↩︎
He does not enjoy writing about himself↩︎
Theory here is ill defined. Loosely, “theory” is activities that are closer to writing theorem-proof mathematics while “practice” is more akin to using built-in statistical computing tools in a language like R.↩︎
This is impossible.↩︎
Students often ask if we will cover support vector machines or deep learning or insert latest buzzword model here. The author believes this is because students consider these to be “cool” methods. One of the goals of this text is to make machine learning seems as uncool as possible. The hope would be for readers to understand something like an SVM to be “just another method” which also needs to be evaluated. The author believes deep learning is useful, but would clutter the presentation because of the additional background and computing that would need to be introduced. Follow-up learning of deep learning is encouraged after reading BSL. Hopefully, by reading BSL, getting up to speed using deep learning will be made easier.↩︎
Also provided the user reads the documentation.↩︎
Gareth James et al., An Introduction to Statistical Learning, vol. 112 (Springer, 2013), http://faculty.marshall.usc.edu/gareth-james/ISL/↩︎
This book is generally referred to as ISL.↩︎
He has spent so much time referencing ISL that he found and suggested a typo fix.↩︎
One of the biggest strengths of ISL is its readability.↩︎
The biggest differences are: Assumed reader background, overall structure, and R code usage and style.↩︎
Jeff taught the first proto-version of STAT 432 as a topics course, but then allowed the author to take over teaching and development while he worked to get the course fully approved.↩︎
If you are using this text elsewhere, that’s great! Please let the author know!↩︎
The author has no idea what else to write in this section, but the last updated date seems like useful information.↩︎
Yihui Xie, Bookdown: Authoring Books and Technical Documents with R Markdown, 2020, https://github.com/rstudio/bookdown↩︎
Wikipedia: Second-System Effect↩︎
Yihui Xie: You Do Not Need to Tell Me I Have A Typo in My Documentation↩︎
But also consider using this opportunity to learn a bit about version control!↩︎