Feature Engineering

Tidy Data Science with the Tidyverse and Tidymodels

W. Jake Thompson

https://tidyds-2021.wjakethompson.com · https://bit.ly/tidyds-2021

Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.

Your Turn 0Open the R Notebook materials/exercises/13-recipes.Rmd
Run the setup chunk


01:00

Machine Learning

tidymodels

Pop quiz!

What is multicollinearity?

Pop quiz!

What is multicollinearity?

When multiple predictors are strongly correlated. It can impair linear models.

Principle Components Analysis

Transforms variables into the orthogonal "components" that most concisely capture all of the variation.

Goal

To fit a linear model to the main Principal Components of the ames data.

To build a recipe

1. Start the recipe()

2. Define the variables involved

3. Describe preprocessing step-by-step

`recipe()`

Creates a recipe for a set of variables.

recipe(Sale_Price ~ ., data = ames)

`step_*()`

Adds a single transformation to a recipe. Transformations are replayed in order when the recipe is run on data.

rec %>%
  step_novel(all_nominal()) %>%
  step_zv(all_predictors())

`step_*()`

Complete list at: https://recipes.tidymodels.org/reference/index.html

Selectors

Helper functions fro selecting sets of variables.

rec %>% step_novel(all_nominal()) %>% step_zv(all_predictors())

selector	description
`all_predictors()`	Each x variable (right side of ~)
`all_outcomes()`	Each y variable (left side of ~)
`all_numeric()`	Each numeric variable
`all_nominal()`	Each categorical variable (e.g. factor, string)
`dplyr::select()` helpers	`starts_with('Lot_')`, etc.

Combining selectors

Use commas to separate

rec %>%
  step_novel(all_nominal(), -all_outcomes()) %>%
  step_zv(all_predictors())

Pop quiz!

How does recipes know what is a predictor and what is an outcome?

Pop quiz!

How does recipes know what is a predictor and what is an outcome?

rec <- 
  recipe(Sale_Price ~ .,
         data = ames)

Pop quiz!

How does recipes know what is a predictor and what is an outcome?

rec <- 
  recipe(Sale_Price ~ .,
         data = ames)

The formula ➡️ indicates outcomes vs predictors

Pop quiz!

How does recieps know what is numeric and what is nominal?

Pop quiz!

How does recieps know what is numeric and what is nominal?

rec <- 
  recipe(Sale_Price ~ .,
         data = ames)

Pop quiz!

How does recieps know what is numeric and what is nominal?

rec <- 
  recipe(Sale_Price ~ .,
         data = ames)

The data ➡️ is only used to catalog the names and types of each variable

Pop quiz!

PCA requires variables to be centered and scaled. What does that mean?

Pop quiz!

PCA requires variables to be centered and scaled. What does that mean?

Standardize or z-score

`step_center()`

Centers numeric variables by subtracting the mean

rec <-
  recipe(Sale_Price ~ .,
         data = ames) %>%
  step_center(all_numeric())

`step_scale()`

Scales numeric variables by dividing by the standard deviation

rec <-
  recipe(Sale_Price ~ .,
         data = ames) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric())

Pop quiz!

Why do you need to "train" a recipe?

Pop quiz!

Why do you need to "train" a recipe?

Imagine "scaling" a new data point. What do you subtract from it? What do you divide it by?

`prep()` and `bake()`

"trains" a recipe and then transforms data with the prepped recipe

rec %>%
  prep(training = ames_train) %>%
  bake(new_data = ames_test) # or ames train

`prep()` and `bake()`

"trains" a recipe and then transforms data with the prepped recipe

rec %>%
  prep(training = ames_train) %>%
  bake(new_data = ames_test) # or ames train

You don't need to do this! The fit functions do it for you

Artwork by @allison_horst

rec %>% 
  prep(ames_train) %>%
  bake(ames_test) 
#> # A tibble: 732 x 81
#>    MS_SubClass      MS_Zoning     Lot_Frontage Lot_Area Street Alley  Lot_Shape 
#>    <fct>            <fct>                <dbl>    <dbl> <fct>  <fct>  <fct>     
#>  1 One_Story_1946_… Residential_…      2.48     3.00    Pave   No_Al… Slightly_…
#>  2 One_Story_1946_… Residential_…      0.663    0.217   Pave   No_Al… Regular   
#>  3 One_Story_1946_… Residential_…      1.05     0.153   Pave   No_Al… Regular   
#>  4 Two_Story_1946_… Residential_…      0.514   -0.00732 Pave   No_Al… Slightly_…
#>  5 Two_Story_1946_… Residential_…     -0.320    6.00    Pave   No_Al… Moderatel…
#>  6 One_Story_1946_… Residential_…      0.901    0.185   Pave   No_Al… Regular   
#>  7 One_Story_1946_… Residential_…      0.216   -0.221   Pave   No_Al… Regular   
#>  8 Two_Story_PUD_1… Residential_…     -1.09    -1.16    Pave   No_Al… Regular   
#>  9 Two_Story_1946_… Residential_…     -1.72    -0.304   Pave   No_Al… Regular   
#> 10 Two_Story_1946_… Residential_…      0.00782  0.927   Pave   No_Al… Moderatel…
#> # … with 722 more rows, and 74 more variables: Land_Contour <fct>,
#> #   Utilities <fct>, Lot_Config <fct>, Land_Slope <fct>, Neighborhood <fct>,
#> #   Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#> #   Overall_Qual <fct>, Overall_Cond <fct>, Year_Built <dbl>,
#> #   Year_Remod_Add <dbl>, Roof_Style <fct>, Roof_Matl <fct>,
#> #   Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>,
#> #   Mas_Vnr_Area <dbl>, Exter_Qual <fct>, Exter_Cond <fct>, Foundation <fct>,
#> #   Bsmt_Qual <fct>, Bsmt_Cond <fct>, Bsmt_Exposure <fct>,
#> #   BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>,
#> #   BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, Heating <fct>,
#> #   Heating_QC <fct>, Central_Air <fct>, Electrical <fct>, First_Flr_SF <dbl>,
#> #   Second_Flr_SF <dbl>, Low_Qual_Fin_SF <dbl>, Gr_Liv_Area <dbl>,
#> #   Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, Full_Bath <dbl>,
#> #   Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>,
#> #   Kitchen_Qual <fct>, TotRms_AbvGrd <dbl>, Functional <fct>,
#> #   Fireplaces <dbl>, Fireplace_Qu <fct>, Garage_Type <fct>,
#> #   Garage_Finish <fct>, Garage_Cars <dbl>, Garage_Area <dbl>,
#> #   Garage_Qual <fct>, Garage_Cond <fct>, Paved_Drive <fct>,
#> #   Wood_Deck_SF <dbl>, Open_Porch_SF <dbl>, Enclosed_Porch <dbl>,
#> #   Three_season_porch <dbl>, Screen_Porch <dbl>, Pool_Area <dbl>,
#> #   Pool_QC <fct>, Fence <fct>, Misc_Feature <fct>, Misc_Val <dbl>,
#> #   Mo_Sold <dbl>, Year_Sold <dbl>, Sale_Type <fct>, Sale_Condition <fct>,
#> #   Longitude <dbl>, Latitude <dbl>, Sale_Price <dbl>

Pop quiz!# A tibble: 6 x 1
  Roof_Style
  <fct>     
1 Hip       
2 Gable     
3 Mansard   
4 Gambrel   
5 Shed      
6 Flat
# A tibble: 2,930 x 6
     Hip Gable Mansard Gambrel  Shed  Flat
   <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl>
 1     1     0       0       0     0     0
 2     0     1       0       0     0     0
 3     1     0       0       0     0     0
 4     1     0       0       0     0     0
 5     0     1       0       0     0     0
 6     0     1       0       0     0     0
 7     0     1       0       0     0     0
 8     0     1       0       0     0     0
 9     0     1       0       0     0     0
10     0     1       0       0     0     0
# … with 2,920 more rows

Dummy Variables

lm(Sale_Price ~ Roof_Style, data = ames)

#> # A tibble: 6 x 5
#>   term              estimate std.error statistic  p.value
#>   <chr>                <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)        184799.    17167.    10.8   1.57e-26
#> 2 Roof_StyleGable    -14487.    17240.    -0.840 4.01e- 1
#> 3 Roof_StyleGambrel  -46514.    23719.    -1.96  5.00e- 2
#> 4 Roof_StyleHip       41891.    17475.     2.40  1.66e- 2
#> 5 Roof_StyleMansard  -18573.    28818.    -0.644 5.19e- 1
#> 6 Roof_StyleShed       8401.    38386.     0.219 8.27e- 1

`step_dummy()`

Converts nominal data into dummy variables which are numeric and suitable for linear algebra.

rec %>%
  step_dummy(all_nominal())

You don't need this for decision trees or ensembles of trees

Consider