13
Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.
01:00
What is multicollinearity?
What is multicollinearity?
When multiple predictors are strongly correlated. It can impair linear models.
Transforms variables into the orthogonal "components" that most concisely capture all of the variation.
To fit a linear model to the main Principal Components of the ames
data.
1. Start the recipe()
2. Define the variables involved
3. Describe preprocessing step-by-step
recipe()
Creates a recipe for a set of variables.
recipe(Sale_Price ~ ., data = ames)
step_*()
Adds a single transformation to a recipe. Transformations are replayed in order when the recipe is run on data.
rec %>% step_novel(all_nominal()) %>% step_zv(all_predictors())
step_*()
Complete list at: https://recipes.tidymodels.org/reference/index.html
Helper functions fro selecting sets of variables.
rec %>%
step_novel(all_nominal()) %>%
step_zv(all_predictors())
selector | description |
---|---|
|
Each x variable (right side of ~) |
|
Each y variable (left side of ~) |
|
Each numeric variable |
|
Each categorical variable (e.g. factor, string) |
|
|
Use commas to separate
rec %>% step_novel(all_nominal(), -all_outcomes()) %>% step_zv(all_predictors())
How does recipes know what is a predictor and what is an outcome?
How does recipes know what is a predictor and what is an outcome?
rec <- recipe(Sale_Price ~ ., data = ames)
How does recipes know what is a predictor and what is an outcome?
rec <- recipe(Sale_Price ~ ., data = ames)
The formula ➡️ indicates outcomes vs predictors
How does recieps know what is numeric and what is nominal?
How does recieps know what is numeric and what is nominal?
rec <- recipe(Sale_Price ~ ., data = ames)
How does recieps know what is numeric and what is nominal?
rec <- recipe(Sale_Price ~ ., data = ames)
The data ➡️ is only used to catalog the names and types of each variable
PCA requires variables to be centered and scaled. What does that mean?
PCA requires variables to be centered and scaled. What does that mean?
Standardize or z-score
step_center()
Centers numeric variables by subtracting the mean
rec <- recipe(Sale_Price ~ ., data = ames) %>% step_center(all_numeric())
step_scale()
Scales numeric variables by dividing by the standard deviation
rec <- recipe(Sale_Price ~ ., data = ames) %>% step_center(all_numeric()) %>% step_scale(all_numeric())
Why do you need to "train" a recipe?
Why do you need to "train" a recipe?
Imagine "scaling" a new data point. What do you subtract from it? What do you divide it by?
prep()
and bake()
"trains" a recipe and then transforms data with the prepped recipe
rec %>% prep(training = ames_train) %>% bake(new_data = ames_test) # or ames train
prep()
and bake()
"trains" a recipe and then transforms data with the prepped recipe
rec %>% prep(training = ames_train) %>% bake(new_data = ames_test) # or ames train
You don't need to do this! The fit functions do it for you
Artwork by @allison_horst
rec %>% prep(ames_train) %>% bake(ames_test) #> # A tibble: 732 x 81#> MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape #> <fct> <fct> <dbl> <dbl> <fct> <fct> <fct> #> 1 One_Story_1946_… Residential_… 2.48 3.00 Pave No_Al… Slightly_…#> 2 One_Story_1946_… Residential_… 0.663 0.217 Pave No_Al… Regular #> 3 One_Story_1946_… Residential_… 1.05 0.153 Pave No_Al… Regular #> 4 Two_Story_1946_… Residential_… 0.514 -0.00732 Pave No_Al… Slightly_…#> 5 Two_Story_1946_… Residential_… -0.320 6.00 Pave No_Al… Moderatel…#> 6 One_Story_1946_… Residential_… 0.901 0.185 Pave No_Al… Regular #> 7 One_Story_1946_… Residential_… 0.216 -0.221 Pave No_Al… Regular #> 8 Two_Story_PUD_1… Residential_… -1.09 -1.16 Pave No_Al… Regular #> 9 Two_Story_1946_… Residential_… -1.72 -0.304 Pave No_Al… Regular #> 10 Two_Story_1946_… Residential_… 0.00782 0.927 Pave No_Al… Moderatel…#> # … with 722 more rows, and 74 more variables: Land_Contour <fct>,#> # Utilities <fct>, Lot_Config <fct>, Land_Slope <fct>, Neighborhood <fct>,#> # Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,#> # Overall_Qual <fct>, Overall_Cond <fct>, Year_Built <dbl>,#> # Year_Remod_Add <dbl>, Roof_Style <fct>, Roof_Matl <fct>,#> # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,#> # Mas_Vnr_Area <dbl>, Exter_Qual <fct>, Exter_Cond <fct>, Foundation <fct>,#> # Bsmt_Qual <fct>, Bsmt_Cond <fct>, Bsmt_Exposure <fct>,#> # BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>,#> # BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, Heating <fct>,#> # Heating_QC <fct>, Central_Air <fct>, Electrical <fct>, First_Flr_SF <dbl>,#> # Second_Flr_SF <dbl>, Low_Qual_Fin_SF <dbl>, Gr_Liv_Area <dbl>,#> # Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, Full_Bath <dbl>,#> # Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>,#> # Kitchen_Qual <fct>, TotRms_AbvGrd <dbl>, Functional <fct>,#> # Fireplaces <dbl>, Fireplace_Qu <fct>, Garage_Type <fct>,#> # Garage_Finish <fct>, Garage_Cars <dbl>, Garage_Area <dbl>,#> # Garage_Qual <fct>, Garage_Cond <fct>, Paved_Drive <fct>,#> # Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>, Enclosed_Porch <dbl>,#> # Three_season_porch <dbl>, Screen_Porch <dbl>, Pool_Area <dbl>,#> # Pool_QC <fct>, Fence <fct>, Misc_Feature <fct>, Misc_Val <dbl>,#> # Mo_Sold <dbl>, Year_Sold <dbl>, Sale_Type <fct>, Sale_Condition <fct>,#> # Longitude <dbl>, Latitude <dbl>, Sale_Price <dbl>
# A tibble: 6 x 1 Roof_Style <fct> 1 Hip 2 Gable 3 Mansard 4 Gambrel 5 Shed 6 Flat
# A tibble: 2,930 x 6 Hip Gable Mansard Gambrel Shed Flat <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 0 0 0 0 0 2 0 1 0 0 0 0 3 1 0 0 0 0 0 4 1 0 0 0 0 0 5 0 1 0 0 0 0 6 0 1 0 0 0 0 7 0 1 0 0 0 0 8 0 1 0 0 0 0 9 0 1 0 0 0 010 0 1 0 0 0 0# … with 2,920 more rows
lm(Sale_Price ~ Roof_Style, data = ames)
#> # A tibble: 6 x 5#> term estimate std.error statistic p.value#> <chr> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 184799. 17167. 10.8 1.57e-26#> 2 Roof_StyleGable -14487. 17240. -0.840 4.01e- 1#> 3 Roof_StyleGambrel -46514. 23719. -1.96 5.00e- 2#> 4 Roof_StyleHip 41891. 17475. 2.40 1.66e- 2#> 5 Roof_StyleMansard -18573. 28818. -0.644 5.19e- 1#> 6 Roof_StyleShed 8401. 38386. 0.219 8.27e- 1
step_dummy()
Converts nominal data into dummy variables which are numeric and suitable for linear algebra.
rec %>% step_dummy(all_nominal())
You don't need this for decision trees or ensembles of trees
Let's think about the modeling.
What if there were no homes with shed roofs in the training data?
Let's think about the modeling.
What if there were no homes with shed roofs in the training data?
Will the model have a coefficient for shed roof?
Let's think about the modeling.
What if there were no homes with shed roofs in the training data?
Will the model have a coefficient for shed roof?
No
Let's think about the modeling.
What if there were no homes with shed roofs in the training data?
Will the model have a coefficient for shed roof?
No
What will happen if the test data has a home with a shed roof?
Let's think about the modeling.
What if there were no homes with shed roofs in the training data?
Will the model have a coefficient for shed roof?
No
What will happen if the test data has a home with a shed roof?
Error!
step_novel()
Adds a catch-all level to a factor for any new values, which lets R intelligently predict new levels in the test set.
rec %>% step_novel(all_nominal()) %>% step_dummy(all_nominal())
Use before step_dummy()
so new level is dummified
What would happen if you try to scale a variable that doesn't vary?
What would happen if you try to scale a variable that doesn't vary?
Error! You'd be dividing by zero!
step_zv()
Intelligently handles zero variance variables (variables that contain only a single value)
rec %>% step_novel(all_nominal()) %>% step_dummy(all_nominal()) %>% step_zv(all_predictors())
What step function would do PCA?
What step function would do PCA?
step_pca()
Replaces variables with components
rec %>% step_pca(all_numeric(), num_comp = 5)
Write a recipe for the Sale_Price ~ .
variables that:
Save the result as pca_rec
05:00
pca_rec <- recipe(Sale_Price ~ ., data = ames) %>% step_novel(all_nominal()) %>% step_dummy(all_nominal()) %>% step_zv(all_predictors()) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) %>% step_pca(all_predictors(), num_comp = 5)
pca_rec#> Data Recipe#> #> Inputs:#> #> role #variables#> outcome 1#> predictor 80#> #> Operations:#> #> Novel factor level assignment for all_nominal()#> Dummy variables from all_nominal()#> Zero variance filter on all_predictors()#> Centering for all_predictors()#> Scaling for all_predictors()#> No PCA components were extracted.
You can also give variables a "role" within a recipe and then select by roles.
has_role(match = "privacy")add_role(rec, Fence, new_role = "privacy")update_role(rec, Fence, new_role = "privacy", old_role = "yard")remove_role(rec, Fence, old_role = "yard")
If we use add_model()
to add a model to a workflow, what would we use to add a recipe?
If we use add_model()
to add a model to a workflow, what would we use to add a recipe?
Let's see!
Make a workflow that combines pca_rec
and with lm_spec
.
02:00
pca_wf <- workflow() %>% add_recipe(pca_rec) %>% add_model(lm_spec)
pca_wf#> ══ Workflow ════════════════════════════════════════════════════════════════════#> Preprocessor: Recipe#> Model: linear_reg()#> #> ── Preprocessor ────────────────────────────────────────────────────────────────#> 6 Recipe Steps#> #> • step_novel()#> • step_dummy()#> • step_zv()#> • step_center()#> • step_scale()#> • step_pca()#> #> ── Model ───────────────────────────────────────────────────────────────────────#> Linear Regression Model Specification (regression)#> #> Computational engine: lm
add_recipe()
Adds a recipe to a workflow.
pca_wf <- workflow() %>% add_recipe(pca_rec) %>% add_model(lm_spec)
Do you need to add a formula if you have a recipe?
Do you need to add a formula if you have a recipe?
Nope!
rec <- recipe(Sale_Price ~ ., data = ames)
Try our PCA workflow on ames_folds
. What is the estimated RMSE?
04:00
pca_wf %>% fit_resamples(resamples = ames_folds) %>% collect_metrics()#> # A tibble: 2 x 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 38937. 10 1349. Preprocessor1_Model1#> 2 rsq standard 0.761 10 0.0158 Preprocessor1_Model1
update_recipe()
Replace the recipe in a workflow.
pca_wf %>% update_recipe(bc_rec)
Modify the code to build a new PCA recipe that uses a BoxCox transformation instead of centering and scaling the data.
Then update pca_wf
to use the new recipe.
Hint: Guess. Use tab completion. Or visit https://recipes.tidymodels.org/reference/index.html.
03:00
bc_rec <- recipe(Sale_Price ~ ., data = ames) %>% step_novel(all_nominal()) %>% step_dummy(all_nominal()) %>% step_zv(all_predictors()) %>% step_BoxCox(all_predictors()) %>% step_pca(all_predictors(), num_comp = 5)bc_wf <- pca_wf %>% update_recipe(bc_rec)
bc_wf %>% fit_resamples(resamples = ames_folds) %>% collect_metrics()#> # A tibble: 2 x 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 44203. 10 2286. Preprocessor1_Model1#> 2 rsq standard 0.699 10 0.0287 Preprocessor1_Model1
Before
Before
After
library(modeldata)data(stackoverflow)
glimpse(stackoverflow)#> Rows: 5,594#> Columns: 21#> $ Country <fct> United Kingdom, United States, Un…#> $ Salary <dbl> 100000.000, 130000.000, 175000.00…#> $ YearsCodedJob <int> 20, 20, 16, 4, 1, 1, 13, 4, 7, 17…#> $ OpenSource <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, …#> $ Hobby <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, …#> $ CompanySizeNumber <dbl> 5000, 1000, 10000, 1000, 5000, 20…#> $ Remote <fct> Remote, Remote, Not remote, Not r…#> $ CareerSatisfaction <int> 8, 9, 7, 9, 5, 8, 7, 7, 8, 9, 10,…#> $ Data_scientist <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …#> $ Database_administrator <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …#> $ Desktop_applications_developer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, …#> $ Developer_with_stats_math_background <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, …#> $ DevOps <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …#> $ Embedded_developer <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …#> $ Graphic_designer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …#> $ Graphics_programming <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …#> $ Machine_learning_specialist <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …#> $ Mobile_developer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …#> $ Quality_assurance_engineer <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, …#> $ Systems_administrator <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …#> $ Web_developer <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, …
Bigger version of what we used earlier.
Name that package!
set.seed(100) # Important!so_split <- initial_split(stackoverflow, strata = Remote)so_train <- training(so_split)so_test <- testing(so_split)
Name that package!
set.seed(100) # Important!so_split <- initial_split(stackoverflow, strata = Remote)so_train <- training(so_split)so_test <- testing(so_split)
Name that package!
tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification")
Name that package!
tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification")
Name that package!
so_rec <- recipe(Remote ~ ., data = so_train) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_lincomb(all_predictors())
Name that package!
so_rec <- recipe(Remote ~ ., data = so_train) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_lincomb(all_predictors())
Name that package!
so_wf <- workflow() %>% add_model(tree_spec) %>% add_recipe(so_rec)
Name that package!
so_wf <- workflow() %>% add_model(tree_spec) %>% add_recipe(so_rec)
fit()
set.seed(1980)so_fit <- so_wf %>% fit(data = so_train)so_preds <- bind_cols( predict(so_fit, new_data = so_test, type = "class"), predict(so_fit, new_data = so_test, type = "prob")) %>% mutate(truth = so_test$Remote)so_metric_set <- metric_set(accuracy, roc_auc)so_metric_set(so_preds, truth = truth, .pred_Remote, estimate = .pred_class)#> # A tibble: 2 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 accuracy binary 0.898#> 2 roc_auc binary 0.5
so_metric_set <- metric_set(accuracy, roc_auc, sens, spec)so_metric_set(so_preds, truth = truth, .pred_Remote, estimate = .pred_class)#> # A tibble: 4 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 accuracy binary 0.898#> 2 sens binary 0 #> 3 spec binary 1 #> 4 roc_auc binary 0.5
Can you guess what the confusion matrix looks like?
Can you guess what the confusion matrix looks like?
conf_mat(so_preds, truth = truth, estimate = .pred_class)#> Truth#> Prediction Remote Not remote#> Remote 0 0#> Not remote 143 1254
Can you guess what the confusion matrix looks like?
so_train %>% count(Remote)#> # A tibble: 2 x 2#> Remote n#> <fct> <int>#> 1 Remote 432#> 2 Not remote 3765so_test %>% count(Remote)#> # A tibble: 2 x 2#> Remote n#> <fct> <int>#> 1 Remote 143#> 2 Not remote 1254
Sub-class sampling
Add a recipe step to downsample the remote variable majority class in the training set prior to model training. Edit your workflow, then re-fit the model and examine the metrics. Is the ROC AUC better than chance (.5)?
05:00
so_down <- recipe(Remote ~ ., data = so_train) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_lincomb(all_predictors()) %>% step_downsample(all_outcomes())so_downwf <- so_wf %>% update_recipe(so_down)set.seed(1980)so_downfit <- so_downwf %>% fit(data = so_train)so_downpreds <- bind_cols( predict(so_downfit, new_data = so_test, type = "class"), predict(so_downfit, new_data = so_test, type = "prob")) %>% mutate(truth = so_test$Remote)
so_metric_set(so_downpreds, truth = truth, .pred_Remote, estimate = .pred_class)#> # A tibble: 4 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 accuracy binary 0.658#> 2 sens binary 0.552#> 3 spec binary 0.670#> 4 roc_auc binary 0.630
juice()
Get the preprocessed training data back from a prepped recipe. Returns a tibble.
so_down %>% prep(training = so_train) %>% juice()
so_train %>% count(Remote)#> # A tibble: 2 x 2#> Remote n#> <fct> <int>#> 1 Remote 432#> 2 Not remote 3765
so_train %>% count(Remote)#> # A tibble: 2 x 2#> Remote n#> <fct> <int>#> 1 Remote 432#> 2 Not remote 3765
so_down %>% prep(training = so_train) %>% juice() %>% count(Remote)#> # A tibble: 2 x 2#> Remote n#> <fct> <int>#> 1 Remote 432#> 2 Not remote 432
step_downsample()
Down-sampling is performed on the training set only. Default is skip = TRUE
.
so_test %>% count(Remote)#> # A tibble: 2 x 2#> Remote n#> <fct> <int>#> 1 Remote 143#> 2 Not remote 1254
step_downsample()
Down-sampling is performed on the training set only. Default is skip = TRUE
.
so_test %>% count(Remote)#> # A tibble: 2 x 2#> Remote n#> <fct> <int>#> 1 Remote 143#> 2 Not remote 1254
so_down %>% prep(training = so_train) %>% bake(new_data = so_test) %>% count(Remote)#> # A tibble: 2 x 2#> Remote n#> <fct> <int>#> 1 Remote 143#> 2 Not remote 1254
Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.
01:00
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
s | Toggle scribble toolbox |
Esc | Back to slideshow |