10
Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.
01:00
As the number of 🦄 increases, so does the number of 🌈.
Random Sampling
Sample Variation
You don't have to collect more data.
You don't have to sacrifice fit for flexibility.
What is the correlation between bill length and bill depth?
library(palmerpenguins)penguins#> # A tibble: 344 x 8#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g#> <fct> <fct> <dbl> <dbl> <int> <int>#> 1 Adelie Torgersen 39.1 18.7 181 3750#> 2 Adelie Torgersen 39.5 17.4 186 3800#> 3 Adelie Torgersen 40.3 18 195 3250#> 4 Adelie Torgersen NA NA NA NA#> 5 Adelie Torgersen 36.7 19.3 193 3450#> 6 Adelie Torgersen 39.3 20.6 190 3650#> 7 Adelie Torgersen 38.9 17.8 181 3625#> 8 Adelie Torgersen 39.2 19.6 195 4675#> 9 Adelie Torgersen 34.1 18.1 193 3475#> 10 Adelie Torgersen 42 20.2 190 4250#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
Artwork by @allison_horst
bootstraps()
Create bootstrap samples from a data set.
set.seed(100) # Important!penguin_boot <- bootstraps(penguins, times = 25)
Use bootstraps()
to create 100 bootstrap samples of the penguins
data.
Save the bootstrap samples as penguin_boot
.
Keep set.seed(100)
!
02:00
set.seed(100)penguin_boot <- bootstraps(penguins, times = 100)penguin_boot#> # Bootstrap sampling #> # A tibble: 100 x 2#> splits id #> <list> <chr> #> 1 <split [344/134]> Bootstrap001#> 2 <split [344/126]> Bootstrap002#> 3 <split [344/124]> Bootstrap003#> 4 <split [344/129]> Bootstrap004#> 5 <split [344/127]> Bootstrap005#> 6 <split [344/124]> Bootstrap006#> 7 <split [344/127]> Bootstrap007#> 8 <split [344/125]> Bootstrap008#> 9 <split [344/128]> Bootstrap009#> 10 <split [344/123]> Bootstrap010#> # … with 90 more rows
What is a <list>
?
penguin_boot#> # Bootstrap sampling #> # A tibble: 100 x 2#> splits id #> <list> <chr> #> 1 <split [344/134]> Bootstrap001#> 2 <split [344/126]> Bootstrap002#> 3 <split [344/124]> Bootstrap003#> 4 <split [344/129]> Bootstrap004#> 5 <split [344/127]> Bootstrap005#> 6 <split [344/124]> Bootstrap006#> 7 <split [344/127]> Bootstrap007#> 8 <split [344/125]> Bootstrap008#> 9 <split [344/128]> Bootstrap009#> 10 <split [344/123]> Bootstrap010#> # … with 90 more rows
penguin_boot$splits[[1]]
#>
#> <344/134/344>
<344/134/344>
<344/134/344>
<
344
/134/344>
>>> Size of resample (analysis set)
<344/134/344>
<
344
/134/344>
>>> Size of resample (analysis set)
<344/
134
/344>
>>> Size of holdout/unused data (assessment set)
<344/134/344>
<
344
/134/344>
>>> Size of resample (analysis set)
<344/
134
/344>
>>> Size of holdout/unused data (assessment set)
<344/134/
344
>
>>> Total size of data set
boot1 <- penguin_boot$splits[[1]]boot1#> <Analysis/Assess/Total>#> <344/134/344>
boot1 <- penguin_boot$splits[[1]]boot1#> <Analysis/Assess/Total>#> <344/134/344>
analysis()
analysis(boot1)#> # A tibble: 344 x 8#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g#> <fct> <fct> <dbl> <dbl> <int> <int>#> 1 Gentoo Biscoe 45.2 15.8 215 5300#> 2 Adelie Biscoe 45.6 20.3 191 4600#> 3 Gentoo Biscoe 50.1 15 225 5000#> 4 Adelie Torgersen NA NA NA NA#> 5 Chinstrap Dream 49.7 18.6 195 3600#> 6 Chinstrap Dream 49.8 17.3 198 3675#> 7 Adelie Dream 40.3 18.5 196 4350#> 8 Adelie Torgersen 38.9 17.8 181 3625#> 9 Gentoo Biscoe 47.3 15.3 222 5250#> 10 Chinstrap Dream 43.2 16.6 187 2900#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
boot1 <- penguin_boot$splits[[1]]boot1#> <Analysis/Assess/Total>#> <344/134/344>
analysis()
analysis(boot1)#> # A tibble: 344 x 8#> species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g#> <fct> <fct> <dbl> <dbl> <int> <int>#> 1 Gentoo Biscoe 45.2 15.8 215 5300#> 2 Adelie Biscoe 45.6 20.3 191 4600#> 3 Gentoo Biscoe 50.1 15 225 5000#> 4 Adelie Torgersen NA NA NA NA#> 5 Chinstrap Dream 49.7 18.6 195 3600#> 6 Chinstrap Dream 49.8 17.3 198 3675#> 7 Adelie Dream 40.3 18.5 196 4350#> 8 Adelie Torgersen 38.9 17.8 181 3625#> 9 Gentoo Biscoe 47.3 15.3 222 5250#> 10 Chinstrap Dream 43.2 16.6 187 2900#> # … with 334 more rows, and 2 more variables: sex <fct>, year <int>
assessment()
assessment(boot1)#> # A tibble: 134 x 8#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g#> <fct> <fct> <dbl> <dbl> <int> <int>#> 1 Adelie Torgersen 36.7 19.3 193 3450#> 2 Adelie Torgersen 39.3 20.6 190 3650#> 3 Adelie Torgersen 39.2 19.6 195 4675#> 4 Adelie Torgersen 34.1 18.1 193 3475#> 5 Adelie Torgersen 42 20.2 190 4250#> 6 Adelie Torgersen 38.7 19 195 3450#> 7 Adelie Torgersen 42.5 20.7 197 4500#> 8 Adelie Biscoe 37.7 18.7 180 3600#> 9 Adelie Biscoe 38.2 18.1 185 3950#> 10 Adelie Biscoe 40.6 18.6 183 3550#> # … with 124 more rows, and 2 more variables: sex <fct>, year <int>
Why is the assessment set a different size for each bootstrap resample?
#> # Bootstrap sampling #> # A tibble: 100 x 2#> splits id #> <list> <chr> #> 1 <split [344/134]> Bootstrap001#> 2 <split [344/126]> Bootstrap002#> 3 <split [344/124]> Bootstrap003#> 4 <split [344/129]> Bootstrap004#> 5 <split [344/127]> Bootstrap005#> 6 <split [344/124]> Bootstrap006#> 7 <split [344/127]> Bootstrap007#> 8 <split [344/125]> Bootstrap008#> 9 <split [344/128]> Bootstrap009#> 10 <split [344/123]> Bootstrap010#> # … with 90 more rows
01:00
To estimate the correlation for a data set, use the cor()
function.
boot1 <- penguin_boot$splits[[1]]boot_sample <- analysis(boot1)cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm)#> [1] NA
To estimate the correlation for a data set, use the cor()
function.
boot1 <- penguin_boot$splits[[1]]boot_sample <- analysis(boot1)cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm)#> [1] NA
cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm, use = "complete.obs")#> [1] -0.1249339
Complete the code to calculate the correlation for the fifth bootstrap sample.
boot5 <-
boot_sample <- (boot5)
cor(boot_sample$ , boot_sample$ ,
use = )
04:00
boot5 <- penguin_boot$splits[[5]]boot_sample <- analysis(boot5)cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm, use = "complete.obs")#> [1] -0.2450851
map()
and friendsApplies a function to every element of a list.
map(.x, .f, ...)
map()
and friendsApplies a function to every element of a list.
map(.x, .f, ...)
What output do you expect?
map_chr()
, map_dbl()
, map_int()
, map_lgl()
, map_df()
, or the general map()
map()
map(penguin_boot$splits, .f)
We need a function that:
1. Takes in a split
2. Pull out the analysis set
3. Calculates the correlation of the analysis set
Use our code from earlier
boot_sample <- analysis(split)
cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,
use = "complete.obs")
Use our code from earlier
boot_sample <- analysis(split)
cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,
use = "complete.obs")
function()
The function()
function defines a new function
penguin_cor <- function(split) {
boot_sample <- analysis(split)
cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,
use = "complete.obs")
}
function()
The code gets wrapped inside of the curly braces
penguin_cor <- function(split) {
boot_sample <- analysis(split)
cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,
use = "complete.obs")
}
function()
Give the function a name
penguin_cor <- function(split) {
boot_sample <- analysis(split)
cor(boot_sample$bill_length_mm, boot_sample$bill_depth_mm,
use = "complete.obs")
}
boot1 <- penguin_boot$splits[[1]]boot1_sample <- analysis(boot1)cor(boot1_sample$bill_length_mm, boot1_sample$bill_depth_mm, use = "complete.obs")#> [1] -0.1249339
boot1 <- penguin_boot$splits[[1]]boot1_sample <- analysis(boot1)cor(boot1_sample$bill_length_mm, boot1_sample$bill_depth_mm, use = "complete.obs")#> [1] -0.1249339
penguin_cor(boot1)#> [1] -0.1249339
map()
map(penguin_boot$splits, penguin_cor)#> [[1]]#> [1] -0.1249339#> #> [[2]]#> [1] -0.2541791#> #> [[3]]#> [1] -0.1991897#> #> [[4]]#> [1] -0.2478776#> #> [[5]]#> [1] -0.2450851#> #> [[6]]#> [1] -0.2792748#> #> [[7]]#> [1] -0.2776204#> #> [[8]]#> [1] -0.2045329#> #> [[9]]#> [1] -0.2431401#> #> [[10]]#> [1] -0.2745247#> #> [[11]]#> [1] -0.3171123#> #> [[12]]#> [1] -0.2259223#> #> [[13]]#> [1] -0.2044941#> #> [[14]]#> [1] -0.1027711#> #> [[15]]#> [1] -0.2506249#> #> [[16]]#> [1] -0.225205#> #> [[17]]#> [1] -0.2336472#> #> [[18]]#> [1] -0.2458328#> #> [[19]]#> [1] -0.2302427#> #> [[20]]#> [1] -0.2119384#> #> [[21]]#> [1] -0.1615742#> #> [[22]]#> [1] -0.308794#> #> [[23]]#> [1] -0.08155008#> #> [[24]]#> [1] -0.2548912#> #> [[25]]#> [1] -0.244613#> #> [[26]]#> [1] -0.2485528#> #> [[27]]#> [1] -0.23714#> #> [[28]]#> [1] -0.2015473#> #> [[29]]#> [1] -0.2691933#> #> [[30]]#> [1] -0.2407836#> #> [[31]]#> [1] -0.2307748#> #> [[32]]#> [1] -0.254897#> #> [[33]]#> [1] -0.2027794#> #> [[34]]#> [1] -0.2285699#> #> [[35]]#> [1] -0.164203#> #> [[36]]#> [1] -0.2912028#> #> [[37]]#> [1] -0.2936487#> #> [[38]]#> [1] -0.2951136#> #> [[39]]#> [1] -0.1986556#> #> [[40]]#> [1] -0.2981575#> #> [[41]]#> [1] -0.2325189#> #> [[42]]#> [1] -0.3074985#> #> [[43]]#> [1] -0.2465553#> #> [[44]]#> [1] -0.1778199#> #> [[45]]#> [1] -0.253044#> #> [[46]]#> [1] -0.2090783#> #> [[47]]#> [1] -0.2371913#> #> [[48]]#> [1] -0.2546507#> #> [[49]]#> [1] -0.2835118#> #> [[50]]#> [1] -0.2080868#> #> [[51]]#> [1] -0.2627895#> #> [[52]]#> [1] -0.198087#> #> [[53]]#> [1] -0.214096#> #> [[54]]#> [1] -0.2348333#> #> [[55]]#> [1] -0.2161979#> #> [[56]]#> [1] -0.2515138#> #> [[57]]#> [1] -0.1813075#> #> [[58]]#> [1] -0.2178123#> #> [[59]]#> [1] -0.3049228#> #> [[60]]#> [1] -0.2239921#> #> [[61]]#> [1] -0.1970032#> #> [[62]]#> [1] -0.2332788#> #> [[63]]#> [1] -0.2436859#> #> [[64]]#> [1] -0.2748855#> #> [[65]]#> [1] -0.2315254#> #> [[66]]#> [1] -0.2064697#> #> [[67]]#> [1] -0.3404718#> #> [[68]]#> [1] -0.1539548#> #> [[69]]#> [1] -0.3189776#> #> [[70]]#> [1] -0.2708725#> #> [[71]]#> [1] -0.2243308#> #> [[72]]#> [1] -0.2800821#> #> [[73]]#> [1] -0.2363227#> #> [[74]]#> [1] -0.2387688#> #> [[75]]#> [1] -0.2112919#> #> [[76]]#> [1] -0.2377676#> #> [[77]]#> [1] -0.1885858#> #> [[78]]#> [1] -0.2639419#> #> [[79]]#> [1] -0.3930653#> #> [[80]]#> [1] -0.2140946#> #> [[81]]#> [1] -0.1766409#> #> [[82]]#> [1] -0.1848252#> #> [[83]]#> [1] -0.1868983#> #> [[84]]#> [1] -0.3049031#> #> [[85]]#> [1] -0.2336121#> #> [[86]]#> [1] -0.2459616#> #> [[87]]#> [1] -0.2671155#> #> [[88]]#> [1] -0.3290882#> #> [[89]]#> [1] -0.2258902#> #> [[90]]#> [1] -0.2425831#> #> [[91]]#> [1] -0.2154876#> #> [[92]]#> [1] -0.2990681#> #> [[93]]#> [1] -0.1735009#> #> [[94]]#> [1] -0.2217043#> #> [[95]]#> [1] -0.2614308#> #> [[96]]#> [1] -0.2955381#> #> [[97]]#> [1] -0.2524951#> #> [[98]]#> [1] -0.2194297#> #> [[99]]#> [1] -0.2214609#> #> [[100]]#> [1] -0.1783059
map()
+ mutate()
penguin_boot %>% mutate(corr = map(splits, penguin_cor))#> # Bootstrap sampling #> # A tibble: 100 x 3#> splits id corr #> <list> <chr> <list> #> 1 <split [344/134]> Bootstrap001 <dbl [1]>#> 2 <split [344/126]> Bootstrap002 <dbl [1]>#> 3 <split [344/124]> Bootstrap003 <dbl [1]>#> 4 <split [344/129]> Bootstrap004 <dbl [1]>#> 5 <split [344/127]> Bootstrap005 <dbl [1]>#> 6 <split [344/124]> Bootstrap006 <dbl [1]>#> 7 <split [344/127]> Bootstrap007 <dbl [1]>#> 8 <split [344/125]> Bootstrap008 <dbl [1]>#> 9 <split [344/128]> Bootstrap009 <dbl [1]>#> 10 <split [344/123]> Bootstrap010 <dbl [1]>#> # … with 90 more rows
What's wrong with this? What could be improved?
map_dbl()
+ mutate()
penguin_boot %>% mutate(corr = map_dbl(splits, penguin_cor))#> # Bootstrap sampling #> # A tibble: 100 x 3#> splits id corr#> <list> <chr> <dbl>#> 1 <split [344/134]> Bootstrap001 -0.125#> 2 <split [344/126]> Bootstrap002 -0.254#> 3 <split [344/124]> Bootstrap003 -0.199#> 4 <split [344/129]> Bootstrap004 -0.248#> 5 <split [344/127]> Bootstrap005 -0.245#> 6 <split [344/124]> Bootstrap006 -0.279#> 7 <split [344/127]> Bootstrap007 -0.278#> 8 <split [344/125]> Bootstrap008 -0.205#> 9 <split [344/128]> Bootstrap009 -0.243#> 10 <split [344/123]> Bootstrap010 -0.275#> # … with 90 more rows
Use the mapping functions and mutate()
to calculate the correlation between bill length and bill depth.
1. Write a function to calculate the correlation from the analysis set of a split.
2. Apply that function to every bootstrap sample using mutate()
and mapping function.
3. Make a histogram of the bootstrapped correlations.
05:00
penguin_boot %>% mutate(corr = map_dbl(splits, penguin_cor)) %>% summarize(avg_corr = mean(corr))#> # A tibble: 1 x 1#> avg_corr#> <dbl>#> 1 -0.237
penguin_boot %>% mutate(corr = map_dbl(splits, penguin_cor)) %>% summarize(interval_95 = quantile(corr, probs = c(0.025, 0.975)), quantile = c(0.025, 0.975))#> # A tibble: 2 x 2#> interval_95 quantile#> <dbl> <dbl>#> 1 -0.324 0.025#> 2 -0.139 0.975
1. Use initial_split()
to create a training and testing set of the penguins data.
2. Write a parsnip specification to fit a linear model that uses flipper length to predict bill length.
3. Use the testing data to calculate the RMSE.
06:00
penguin_split <- initial_split(penguins)penguin_train <- training(penguin_split)penguin_test <- testing(penguin_split)lm_spec <- linear_reg() %>% set_engine("lm") %>% set_mode("regression")lm_model <- fit(lm_spec, bill_length_mm ~ flipper_length_mm, data = penguin_train)lm_preds <- predict(lm_model, new_data = penguin_test) %>% mutate(.truth = penguin_test$bill_length_mm)rmse(lm_preds, truth = .truth, estimate = .pred)#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 4.69
What would happen if you repeated this process? Would you get the same answers?
Rerun the code chunk from the last exercise. Do you get the same answer?
02:00
#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 3.66
#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 3.91
#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 4.61
#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 3.66
#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 3.91
#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 4.61
#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 4.16
#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 4.01
#> # A tibble: 1 x 3#> .metric .estimator .estimate#> <chr> <chr> <dbl>#> 1 rmse standard 3.88
Why is the new estimate different?
Mean RMSE
rmses %>% enframe(name = "rmse")#> # A tibble: 10 x 2#> rmse value#> <int> <dbl>#> 1 1 3.72#> 2 2 3.65#> 3 3 4.70#> 4 4 4.31#> 5 5 3.82#> 6 6 4.39#> 7 7 4.08#> 8 8 4.85#> 9 9 4.57#> 10 10 4.08mean(rmses)#> [1] 4.216917
Which do you think is more accurate, the best result or the mean of the results? Why?
rmses <- vector(length = 10, mode = "double")for (i in seq_along(rmses)) { new_split <- initial_split(penguins) penguin_train <- training(new_split) penguin_test <- testing(new_split) lm_model <- fit(lm_spec, bill_length_mm ~ flipper_length_mm, data = penguin_train) lm_preds <- predict(lm_model, new_data = penguin_test) %>% mutate(.truth = penguin_test$bill_length_mm) rmses[i] <- rmse(lm_preds, truth = .truth, estimate = .pred) %>% pull(.estimate)}
vfold_cv(data, v = 10, ...)
How many times does an observation/row appear in the assessment set?
If we use 10 folds, what percent of our data will end up in the training set and what percent in the testing set for each fold?
01:00
If we use 10 folds, what percent of our data will end up in the training set and what percent in the testing set for each fold?
90% training
10% testing
Run the code below. What does it return?
set.seed(100)cv_folds <- vfold_cv(penguins, v = 10, strata = species)cv_folds
01:00
cv_folds#> # 10-fold cross-validation using stratification #> # A tibble: 10 x 2#> splits id #> <list> <chr> #> 1 <split [308/36]> Fold01#> 2 <split [308/36]> Fold02#> 3 <split [309/35]> Fold03#> 4 <split [309/35]> Fold04#> 5 <split [310/34]> Fold05#> 6 <split [310/34]> Fold06#> 7 <split [310/34]> Fold07#> 8 <split [310/34]> Fold08#> 9 <split [311/33]> Fold09#> 10 <split [311/33]> Fold10
How does this help us?
fit_resamples()
fit_resamples()
Trains and tests a model with cross-validation.
fit_resamples(lm_spec, bill_length_mm ~ flipper_length_mm, resamples = cv_folds)
fit_resamples(lm_spec, bill_length_mm ~ flipper_length_mm, resamples = cv_folds)#> # Resampling results#> # 10-fold cross-validation using stratification #> # A tibble: 10 x 4#> splits id .metrics .notes #> <list> <chr> <list> <list> #> 1 <split [308/36]> Fold01 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>#> 2 <split [308/36]> Fold02 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>#> 3 <split [309/35]> Fold03 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>#> 4 <split [309/35]> Fold04 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>#> 5 <split [310/34]> Fold05 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>#> 6 <split [310/34]> Fold06 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>#> 7 <split [310/34]> Fold07 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>#> 8 <split [310/34]> Fold08 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>#> 9 <split [311/33]> Fold09 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>#> 10 <split [311/33]> Fold10 <tibble[,4] [2 × 4]> <tibble[,1] [0 × 1]>
collect_metrics()
Collect metrics from a cross-validation.
cv_results %>% collect_metrics()#> # A tibble: 2 x 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 4.11 10 0.158 Preprocessor1_Model1#> 2 rsq standard 0.444 10 0.0422 Preprocessor1_Model1
collect_metrics()
Collect metrics from a cross-validation.
cv_results %>% collect_metrics()#> # A tibble: 2 x 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 4.11 10 0.158 Preprocessor1_Model1#> 2 rsq standard 0.444 10 0.0422 Preprocessor1_Model1
cv_results %>% collect_metrics(summarize = FALSE)#> # A tibble: 20 x 5#> id .metric .estimator .estimate .config #> <chr> <chr> <chr> <dbl> <chr> #> 1 Fold01 rmse standard 3.77 Preprocessor1_Model1#> 2 Fold01 rsq standard 0.451 Preprocessor1_Model1#> 3 Fold02 rmse standard 4.43 Preprocessor1_Model1#> 4 Fold02 rsq standard 0.295 Preprocessor1_Model1#> 5 Fold03 rmse standard 3.83 Preprocessor1_Model1#> 6 Fold03 rsq standard 0.315 Preprocessor1_Model1#> 7 Fold04 rmse standard 4.16 Preprocessor1_Model1#> 8 Fold04 rsq standard 0.496 Preprocessor1_Model1#> 9 Fold05 rmse standard 4.16 Preprocessor1_Model1#> 10 Fold05 rsq standard 0.560 Preprocessor1_Model1#> 11 Fold06 rmse standard 5.11 Preprocessor1_Model1#> 12 Fold06 rsq standard 0.220 Preprocessor1_Model1#> 13 Fold07 rmse standard 4.24 Preprocessor1_Model1#> 14 Fold07 rsq standard 0.594 Preprocessor1_Model1#> 15 Fold08 rmse standard 4.45 Preprocessor1_Model1#> 16 Fold08 rsq standard 0.387 Preprocessor1_Model1#> 17 Fold09 rmse standard 3.56 Preprocessor1_Model1#> 18 Fold09 rsq standard 0.593 Preprocessor1_Model1#> 19 Fold10 rmse standard 3.39 Preprocessor1_Model1#> 20 Fold10 rsq standard 0.528 Preprocessor1_Model1
metric_set()
Specify which metrics you want to get back.
fit_resamples(lm_spec, bill_length_mm ~ flipper_length_mm, resamples = cv_folds, metrics = metric_set(rsq)) %>% collect_metrics()#> # A tibble: 1 x 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rsq standard 0.444 10 0.0422 Preprocessor1_Model1
Modify the code below to estimate our model on each of the folds and calculate the average RMSE for our penguin model.
fit(lm_spec, bill_length_mm ~ flipper_length_mm, data = penguins)
03:00
fit_resamples(lm_spec, bill_length_mm ~ flipper_length_mm, resamples = cv_folds, metrics = metric_set(rmse)) %>% collect_metrics()#> # A tibble: 1 x 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 4.11 10 0.158 Preprocessor1_Model1
Use fit_resamples()
and cv_folds
to estimate to models two predict bill length.
1. bill_length_mm ~ flipper_length_mm
2. bill_length_mm ~ species + sex
Compare the performance of each.
06:00
fit_resamples(lm_spec, bill_length_mm ~ flipper_length_mm, resamples = cv_folds) %>% collect_metrics()#> # A tibble: 2 x 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 4.11 10 0.158 Preprocessor1_Model1#> 2 rsq standard 0.444 10 0.0422 Preprocessor1_Model1fit_resamples(lm_spec, bill_length_mm ~ species + sex, resamples = cv_folds) %>% collect_metrics()#> # A tibble: 2 x 6#> .metric .estimator mean n std_err .config #> <chr> <chr> <dbl> <int> <dbl> <chr> #> 1 rmse standard 2.33 10 0.118 Preprocessor1_Model1#> 2 rsq standard 0.833 10 0.0130 Preprocessor1_Model1
Why should you use the same data splits to compare each model?
Why should you use the same data splits to compare each model?
🍎 to 🍎
Does cross-validation measure just the accuracy of your model, or your entire workflow?
Does cross-validation measure just the accuracy of your model, or your entire workflow?
Your entire workflow
vfold_cv()
- V Fold cross-validationloo_cv()
- Leave one out CVmc_cv()
- Monte Carlo (random) CV(Test sets sampled without replacement)
bootstraps()
(Test sets sampled with replacement)
Tidy Data Science with the Tidyverse and Tidymodels is licensed under a Creative Commons Attribution 4.0 International License.
01:00
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
s | Toggle scribble toolbox |
Esc | Back to slideshow |