Chapter 1 Introduction

  • TODO: Show package messaging? check conflicts!
  • TODO: Should this be split into three analyses with different packages?

1.1 Regression: Powerlifting

1.1.2 Data

  • TODO: Why readr::col_factor() and not just col_factor()?
  • TODO: Characters should be character and “categories” should be factors.

  • TODO: Is na.omit() actually a good idea?

## # A tibble: 3,604 x 8
##    Name             Sex   Bodyweight   Age Squat Bench Deadlift Total
##    <chr>            <fct>      <dbl> <dbl> <dbl> <dbl>    <dbl> <dbl>
##  1 Ariel Stier      F           60      32 128.   72.5     150   350 
##  2 Nicole Bueno     F           60      26 110    60       135   305 
##  3 Lisa Peterson    F           67.5    28 118.   67.5     138.  322.
##  4 Shelby Bandula   F           67.5    26  92.5  67.5     140   300 
##  5 Lisa Lindhorst   F           67.5    28  92.5  62.5     132.  288.
##  6 Laura Burnett    F           67.5    30  90    45       108.  242.
##  7 Suzette Bradley  F           75      38 125    75       158.  358.
##  8 Norma Romero     F           75      20  92.5  42.5     125   260 
##  9 Georgia Andrews  F           82.5    29 108.   52.5     120   280 
## 10 Christal Bundang F           90      30 100    55       125   280 
## # … with 3,594 more rows

1.1.4 Modeling

  • TODO: Note: we are not using Name. Why? We are not using Total. Why?
  • TODO: look what happens with Total! You’ll see it with lm(), you’ll be optimistic with randomForest().
  • TODO: What variables are allowed? (With respect to real world problem.)
  • TODO: What variables lead to the best predictions?

1.2 Classification: Handwritten Digits

1.2.5 Model Evaluation

## [1] 0.8839
##          actual
## predicted    0    1    2    3    4    5    6    7    8    9
##         0  959    0   14    6    1   15   22    1   10   10
##         1    0 1112    5    5    1   16    5    9    5    6
##         2    1    2  928   31    3    5   19   24   17    8
##         3    0    2   11  820    1   24    0    1   13   13
##         4    4    0   13    1  839   21   39   11   18   40
##         5    3    1    1   88    3  720   18    1   25    9
##         6    7    2   15    3   25   15  848    0   18    2
##         7    2    1   29   24    1   14    2  928   15   30
##         8    4   14   13   22    5   19    5    4  797    3
##         9    0    1    3   10  103   43    0   49   56  888

1.3 Clustering: NBA Players

1.3.2 Data

## # A tibble: 100 x 93
##    player_team pos     age tm        g    gs    mp    fg   fga fg_percent
##    <chr>       <fct> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
##  1 Álex Abrin… SG       25 OKC      31     2   588    56   157      0.357
##  2 Quincy Acy… PF       28 PHO      10     0   123     4    18      0.222
##  3 Jaylen Ada… PG       22 ATL      34     1   428    38   110      0.345
##  4 Steven Ada… C        25 OKC      80    80  2669   481   809      0.595
##  5 Bam Adebay… C        21 MIA      82    28  1913   280   486      0.576
##  6 Deng Adel … SF       21 CLE      19     3   194    11    36      0.306
##  7 DeVaughn A… SG       25 DEN       7     0    22     3    10      0.3  
##  8 LaMarcus A… C        33 SAS      81    81  2687   684  1319      0.519
##  9 Rawle Alki… SG       21 CHI      10     1   120    13    39      0.333
## 10 Grayson Al… SG       23 UTA      38     2   416    67   178      0.376
## # … with 90 more rows, and 83 more variables: x3p <dbl>, x3pa <dbl>,
## #   x3p_percent <dbl>, x2p <dbl>, x2pa <dbl>, x2p_percent <dbl>,
## #   e_fg_percent <dbl>, ft <dbl>, fta <dbl>, ft_percent <dbl>, orb <dbl>,
## #   drb <dbl>, trb <dbl>, ast <dbl>, stl <dbl>, blk <dbl>, tov <dbl>,
## #   pf <dbl>, pts <dbl>, fg_pm <dbl>, fga_pm <dbl>, fg_percent_pm <dbl>,
## #   x3p_pm <dbl>, x3pa_pm <dbl>, x3p_percent_pm <dbl>, x2p_pm <dbl>,
## #   x2pa_pm <dbl>, x2p_percent_pm <dbl>, ft_pm <dbl>, fta_pm <dbl>,
## #   ft_percent_pm <dbl>, orb_pm <dbl>, drb_pm <dbl>, trb_pm <dbl>,
## #   ast_pm <dbl>, stl_pm <dbl>, blk_pm <dbl>, tov_pm <dbl>, pf_pm <dbl>,
## #   pts_pm <dbl>, fg_pp <dbl>, fga_pp <dbl>, fg_percent_pp <dbl>,
## #   x3p_pp <dbl>, x3pa_pp <dbl>, x3p_percent_pp <dbl>, x2p_pp <dbl>,
## #   x2pa_pp <dbl>, x2p_percent_pp <dbl>, ft_pp <dbl>, fta_pp <dbl>,
## #   ft_percent_pp <dbl>, orb_pp <dbl>, drb_pp <dbl>, trb_pp <dbl>,
## #   ast_pp <dbl>, stl_pp <dbl>, blk_pp <dbl>, tov_pp <dbl>, pf_pp <dbl>,
## #   pts_pp <dbl>, o_rtg_pp <dbl>, d_rtg_pp <dbl>, per <dbl>,
## #   ts_percent <dbl>, x3p_ar <dbl>, f_tr <dbl>, orb_percent <dbl>,
## #   drb_percent <dbl>, trb_percent <dbl>, ast_percent <dbl>,
## #   stl_percent <dbl>, blk_percent <dbl>, tov_percent <dbl>,
## #   usg_percent <dbl>, ows <dbl>, dws <dbl>, ws <dbl>, ws_48 <dbl>,
## #   obpm <dbl>, dbpm <dbl>, bpm <dbl>, vorp <dbl>

1.3.5 Model Evaluation

1.3.6 Discussion