# Chapter 12 Regularization

Chapter Status: Currently this chapter is very sparse. It essentially only expands upon an example discussed in ISL, thus only illustrates usage of the methods. Mathematical and conceptual details of the methods will be added later.

We will use the Hitters dataset from the ISLR package to explore two shrinkage methods: ridge regression and lasso. These are otherwise known as penalized regression methods.

data(Hitters, package = "ISLR")

This dataset has some missing data in the response Salaray. We use the na.omit() function to clean the dataset for ease-of-use.

sum(is.na(Hitters))
## [1] 59
sum(is.na(Hitters$Salary)) ## [1] 59 Hitters = na.omit(Hitters) sum(is.na(Hitters)) ## [1] 0 The feature variables are offensive and defensive statistics for a number of baseball players. names(Hitters) ## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks" ## [7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI" ## [13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors" ## [19] "Salary" "NewLeague" We use the glmnet() and cv.glmnet() functions from the glmnet package to fit penalized regressions. library(glmnet) Unfortunately, the glmnet function does not allow the use of model formulas, so we setup the data for ease of use with glmnet. Eventually we will use train() from caret which does allow for fitting penalized regression with the formula syntax, but to explore some of the details, we first work with the functions from glmnet directly. X = model.matrix(Salary ~ ., Hitters)[, -1] y = Hitters$Salary

Note, we’re being lazy and just using the full dataset as the training dataset.

First, we fit an ordinary linear regression, and note the size of the features’ coefficients, and features’ coefficients squared. (The two penalties we will use.)

fit = lm(Salary ~ ., Hitters)
coef(fit)
##  (Intercept)        AtBat         Hits        HmRun         Runs          RBI
##  163.1035878   -1.9798729    7.5007675    4.3308829   -2.3762100   -1.0449620
##        Walks        Years       CAtBat        CHits       CHmRun        CRuns
##    6.2312863   -3.4890543   -0.1713405    0.1339910   -0.1728611    1.4543049
##         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists
##    0.8077088   -0.8115709   62.5994230 -116.8492456    0.2818925    0.3710692
##       Errors   NewLeagueN
##   -3.3607605  -24.7623251
sum(abs(coef(fit)[-1]))
## [1] 238.7295
sum(coef(fit)[-1] ^ 2)
## [1] 18337.3

## 12.1 Ridge Regression

We first illustrate ridge regression, which can be fit using glmnet() with alpha = 0 and seeks to minimize

$\sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 .$

Notice that the intercept is not penalized. Also, note that that ridge regression is not scale invariant like the usual unpenalized regression. Thankfully, glmnet() takes care of this internally. It automatically standardizes predictors for fitting, then reports fitted coefficient using the original scale.

The two plots illustrate how much the coefficients are penalized for different values of $$\lambda$$. Notice none of the coefficients are forced to be zero.

par(mfrow = c(1, 2))
fit_ridge = glmnet(X, y, alpha = 0)
plot(fit_ridge)
plot(fit_ridge, xvar = "lambda", label = TRUE)

We use cross-validation to select a good $$\lambda$$ value. The cv.glmnet()function uses 10 folds by default. The plot illustrates the MSE for the $$\lambda$$s considered. Two lines are drawn. The first is the $$\lambda$$ that gives the smallest MSE. The second is the $$\lambda$$ that gives an MSE within one standard error of the smallest.

fit_ridge_cv = cv.glmnet(X, y, alpha = 0)
plot(fit_ridge_cv)

The cv.glmnet() function returns several details of the fit for both $$\lambda$$ values in the plot. Notice the penalty terms are smaller than the full linear regression. (As we would expect.)

# estimated coefficients, using 1-SE rule lambda, default behavior
coef(fit_ridge_cv)
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) 199.418113624
## AtBat         0.093426871
## Hits          0.389767263
## HmRun         1.212875007
## Runs          0.623229048
## RBI           0.618547529
## Walks         0.810467707
## Years         2.544170910
## CAtBat        0.007897059
## CHits         0.030554662
## CHmRun        0.226545984
## CRuns         0.061265846
## CRBI          0.063384832
## CWalks        0.060720300
## LeagueN       3.743295031
## DivisionW   -23.545192292
## PutOuts       0.056202373
## Assists       0.007879196
## Errors       -0.164203267
## NewLeagueN    3.313773161
# estimated coefficients, using minimum lambda
coef(fit_ridge_cv, s = "lambda.min")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  8.112693e+01
## AtBat       -6.815959e-01
## Hits         2.772312e+00
## HmRun       -1.365680e+00
## Runs         1.014826e+00
## RBI          7.130225e-01
## Walks        3.378558e+00
## Years       -9.066800e+00
## CAtBat      -1.199478e-03
## CHits        1.361029e-01
## CHmRun       6.979958e-01
## CRuns        2.958896e-01
## CRBI         2.570711e-01
## CWalks      -2.789666e-01
## LeagueN      5.321272e+01
## DivisionW   -1.228345e+02
## PutOuts      2.638876e-01
## Assists      1.698796e-01
## Errors      -3.685645e+00
## NewLeagueN  -1.810510e+01
# penalty term using minimum lambda
sum(coef(fit_ridge_cv, s = "lambda.min")[-1] ^ 2)
## [1] 18367.29
# estimated coefficients, using 1-SE rule lambda
coef(fit_ridge_cv, s = "lambda.1se")
## 20 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) 199.418113624
## AtBat         0.093426871
## Hits          0.389767263
## HmRun         1.212875007
## Runs          0.623229048
## RBI           0.618547529
## Walks         0.810467707
## Years         2.544170910
## CAtBat        0.007897059
## CHits         0.030554662
## CHmRun        0.226545984
## CRuns         0.061265846
## CRBI          0.063384832
## CWalks        0.060720300
## LeagueN       3.743295031
## DivisionW   -23.545192292
## PutOuts       0.056202373
## Assists       0.007879196
## Errors       -0.164203267
## NewLeagueN    3.313773161
# penalty term using 1-SE rule lambda
sum(coef(fit_ridge_cv, s = "lambda.1se")[-1] ^ 2)
## [1] 588.9958
# predict using minimum lambda
predict(fit_ridge_cv, X, s = "lambda.min")
# predict using 1-SE rule lambda, default behavior
predict(fit_ridge_cv, X)
# calculate "train error"
mean((y - predict(fit_ridge_cv, X)) ^ 2)
## [1] 130404.9
# CV-MSEs
fit_ridge_cv$cvm ## [1] 205333.0 203746.6 203042.6 202800.1 202535.1 202245.8 201930.0 201585.5 ## [9] 201210.0 200800.7 200355.2 199870.4 199343.3 198770.8 198149.6 197476.1 ## [17] 196746.9 195958.5 195106.9 194188.7 193200.1 192137.6 190997.7 189777.3 ## [25] 188473.6 187084.0 185606.8 184040.5 182384.5 180639.3 178805.9 176886.6 ## [33] 174885.0 172805.6 170654.4 168438.7 166167.0 163848.8 161494.9 159117.2 ## [41] 156728.1 154340.9 151968.9 149625.5 147323.7 145076.2 142894.4 140789.0 ## [49] 138769.1 136842.4 135014.7 133290.7 131673.5 130164.1 128762.4 127466.9 ## [57] 126275.0 125183.1 124186.7 123280.9 122460.4 121719.7 121053.1 120454.6 ## [65] 119919.2 119444.3 119024.2 118649.9 118318.3 118028.6 117776.4 117554.6 ## [73] 117360.3 117194.2 117049.5 116921.7 116812.9 116716.9 116632.4 116553.6 ## [81] 116485.1 116419.6 116353.8 116293.9 116229.4 116166.5 116098.3 116029.8 ## [89] 115952.0 115877.2 115786.6 115706.7 115603.7 115516.4 115403.0 115306.9 ## [97] 115188.2 115085.7 114964.5 114865.5 # CV-MSE using minimum lambda fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.min]
## [1] 114865.5
# CV-MSE using 1-SE rule lambda
fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.1se] ## [1] 135014.7 ## 12.2 Lasso We now illustrate lasso, which can be fit using glmnet() with alpha = 1 and seeks to minimize $\sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} |\beta_j| .$ Like ridge, lasso is not scale invariant. The two plots illustrate how much the coefficients are penalized for different values of $$\lambda$$. Notice some of the coefficients are forced to be zero. par(mfrow = c(1, 2)) fit_lasso = glmnet(X, y, alpha = 1) plot(fit_lasso) plot(fit_lasso, xvar = "lambda", label = TRUE) Again, to actually pick a $$\lambda$$, we will use cross-validation. The plot is similar to the ridge plot. Notice along the top is the number of features in the model. (Which changed in this plot.) fit_lasso_cv = cv.glmnet(X, y, alpha = 1) plot(fit_lasso_cv) cv.glmnet() returns several details of the fit for both $$\lambda$$ values in the plot. Notice the penalty terms are again smaller than the full linear regression. (As we would expect.) Some coefficients are 0. # estimated coefficients, using 1-SE rule lambda, default behavior coef(fit_lasso_cv) ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 251.4491565 ## AtBat . ## Hits 1.0036044 ## HmRun . ## Runs . ## RBI . ## Walks 0.9857987 ## Years . ## CAtBat . ## CHits . ## CHmRun . ## CRuns 0.1094218 ## CRBI 0.2911570 ## CWalks . ## LeagueN . ## DivisionW . ## PutOuts . ## Assists . ## Errors . ## NewLeagueN . # estimated coefficients, using minimum lambda coef(fit_lasso_cv, s = "lambda.min") ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 20.9671791 ## AtBat . ## Hits 1.8647356 ## HmRun . ## Runs . ## RBI . ## Walks 2.2144706 ## Years . ## CAtBat . ## CHits . ## CHmRun . ## CRuns 0.2067687 ## CRBI 0.4123098 ## CWalks . ## LeagueN 0.8359508 ## DivisionW -102.7759368 ## PutOuts 0.2197627 ## Assists . ## Errors . ## NewLeagueN . # penalty term using minimum lambda sum(coef(fit_lasso_cv, s = "lambda.min")[-1] ^ 2) ## [1] 10572.23 # estimated coefficients, using 1-SE rule lambda coef(fit_lasso_cv, s = "lambda.1se") ## 20 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 251.4491565 ## AtBat . ## Hits 1.0036044 ## HmRun . ## Runs . ## RBI . ## Walks 0.9857987 ## Years . ## CAtBat . ## CHits . ## CHmRun . ## CRuns 0.1094218 ## CRBI 0.2911570 ## CWalks . ## LeagueN . ## DivisionW . ## PutOuts . ## Assists . ## Errors . ## NewLeagueN . # penalty term using 1-SE rule lambda sum(coef(fit_lasso_cv, s = "lambda.1se")[-1] ^ 2) ## [1] 2.075766 # predict using minimum lambda predict(fit_lasso_cv, X, s = "lambda.min") # predict using 1-SE rule lambda, default behavior predict(fit_lasso_cv, X) # calcualte "train error" mean((y - predict(fit_lasso_cv, X)) ^ 2) ## [1] 134647.7 # CV-MSEs fit_lasso_cv$cvm
##  [1] 204170.8 195843.9 186963.3 179622.0 172688.0 165552.2 159227.8 153508.8
##  [9] 148571.9 144422.7 140935.6 137857.2 134976.5 132308.5 129776.3 127472.4
## [17] 125422.6 123615.4 122061.6 120771.5 119696.7 118804.2 118063.8 117451.0
## [25] 116946.3 116563.7 116300.4 116170.3 116132.8 116127.2 116149.0 116183.0
## [33] 116349.2 116600.5 116841.5 117174.1 117598.5 118064.5 118494.5 118796.4
## [41] 118929.0 118976.5 118618.7 118138.5 117636.1 117229.2 116903.9 116659.7
## [49] 116511.3 116429.7 116433.4 116496.7 116635.7 116759.7 116884.4 116892.4
## [57] 116844.4 116781.5 116684.3 116606.2 116552.0 116510.0 116482.6 116475.8
## [65] 116499.1 116537.0 116558.3 116586.5 116598.2 116638.3 116645.8 116686.1
## [73] 116722.1 116751.7 116760.4 116771.7 116792.0 116806.0 116816.9 116825.5
# CV-MSE using minimum lambda
fit_lasso_cv$cvm[fit_lasso_cv$lambda == fit_lasso_cv$lambda.min] ## [1] 116127.2 # CV-MSE using 1-SE rule lambda ## 12.3broom Sometimes, the output from glmnet() can be overwhelming. The broom package can help with that. library(broom) # the output from the commented line would be immense # fit_lasso_cv tidy(fit_lasso_cv) ## # A tibble: 80 x 6 ## lambda estimate std.error conf.low conf.high nzero ## <dbl> <dbl> <dbl> <dbl> <dbl> <int> ## 1 255. 204171. 39048. 165123. 243219. 0 ## 2 233. 195844. 38973. 156870. 234817. 1 ## 3 212. 186963. 37751. 149212. 224714. 2 ## 4 193. 179622. 36640. 142982. 216262. 2 ## 5 176. 172688. 35736. 136952. 208424. 3 ## 6 160. 165552. 35177. 130376. 200729. 4 ## 7 146. 159228. 34767. 124461. 193995. 4 ## 8 133. 153509. 34383. 119126. 187892. 4 ## 9 121. 148572. 34027. 114545. 182599. 4 ## 10 111. 144423. 33723. 110699. 178146. 4 ## # … with 70 more rows # the two lambda values of interest glance(fit_lasso_cv)  ## # A tibble: 1 x 3 ## lambda.min lambda.1se nobs ## <dbl> <dbl> <int> ## 1 17.2 111. 263 ## 12.4 Simulated Data, $$p > n$$ Aside from simply shrinking coefficients (ridge and lasso) and setting some coefficients to 0 (lasso), penalized regression also has the advantage of being able to handle the $$p > n$$ case. set.seed(1234) n = 1000 p = 5500 X = replicate(p, rnorm(n = n)) beta = c(1, 1, 1, rep(0, 5497)) z = X %*% beta prob = exp(z) / (1 + exp(z)) y = as.factor(rbinom(length(z), size = 1, prob = prob)) We first simulate a classification example where $$p > n$$. # glm(y ~ X, family = "binomial") # will not converge We then use a lasso penalty to fit penalized logistic regression. This minimizes $\sum_{i=1}^{n} L\left(y_i, \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\right) + \lambda \sum_{j=1}^{p} |\beta_j|$ where $$L$$ is the appropriate negative log-likelihood. library(glmnet) fit_cv = cv.glmnet(X, y, family = "binomial", alpha = 1) plot(fit_cv) We’re being lazy again and using the entire dataset as the training data. head(coef(fit_cv), n = 10) ## 10 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 0.02397452 ## V1 0.59674958 ## V2 0.56251761 ## V3 0.60065105 ## V4 . ## V5 . ## V6 . ## V7 . ## V8 . ## V9 . fit_cv$nzero
##  s0  s1  s2  s3  s4  s5  s6  s7  s8  s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19
##   0   2   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3
## s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39
##   3   3   3   3   3   3   3   3   3   3   4   6   7  10  18  24  35  54  65  75
## s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53 s54 s55 s56 s57 s58 s59
##  86 100 110 129 147 168 187 202 221 241 254 269 283 298 310 324 333 350 364 375
## s60 s61 s62 s63 s64 s65 s66 s67 s68 s69 s70 s71 s72 s73 s74 s75 s76 s77 s78 s79
## 387 400 411 429 435 445 453 455 462 466 475 481 487 491 496 498 502 504 512 518
## s80 s81 s82 s83 s84 s85 s86 s87 s88 s89 s90 s91 s92 s93 s94 s95 s96 s97 s98 s99
## 523 526 528 536 543 550 559 561 563 566 570 571 576 582 586 590 596 596 600 599

Notice, only the first three predictors generated are truly significant, and that is exactly what the suggested model finds.

fit_1se = glmnet(X, y, family = "binomial", lambda = fit_cv$lambda.1se) which(as.vector(as.matrix(fit_1se$beta)) != 0)
## [1] 1 2 3

We can also see in the following plots, the three features entering the model well ahead of the irrelevant features.

par(mfrow = c(1, 2))
plot(glmnet(X, y, family = "binomial"))
plot(glmnet(X, y, family = "binomial"), xvar = "lambda")

We can extract the two relevant $$\lambda$$ values.

fit_cv$lambda.min ## [1] 0.03718493 fit_cv$lambda.1se
## [1] 0.0514969

TODO: use default of type.measure="deviance" but note that type.measure="class" exists.