class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#009FB7;">7</strong> </span> # Case Study ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).] <div style = "position:fixed; visibility: hidden"> `$$\require{color}\definecolor{yellow}{rgb}{0.996078431372549, 0.843137254901961, 0.4}$$` `$$\require{color}\definecolor{blue}{rgb}{0, 0.623529411764706, 0.717647058823529}$$` </div> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ TeX: { Macros: { yellow: ["{\\color{yellow}{#1}}", 1], blue: ["{\\color{blue}{#1}}", 1] }, loader: {load: ['[tex]/color']}, tex: {packages: {'[+]': ['color']}} } }); </script> <style> .yellow {color: #FED766;} .blue {color: #009FB7;} </style> --- background-image: url(images/case-study-1/tidyr-dplyr-ggplot2.png) background-position: center middle background-size: 85% --- class: your-turn # Your turn 0 .big[ * Open the R Notebook **materials/exercises/07-case-study-1.Rmd** * Run the setup chunk ]
01
:
00
--- # `fivethirtyeight` .big[ An R package containing data sets behind the stories published by [FiveThirtyEight](https://fivethirtyeight.com/) ] ```r library(fivethirtyeight) #> Some larger datasets need to be installed separately, like senators and #> house_district_forecast. To install these, we recommend you install the #> fivethirtyeightdata package by running: #> install.packages('fivethirtyeightdata', repos = #> 'https://fivethirtyeightdata.github.io/drat/', type = 'source') ``` --- class: middle .pull-left[ <img src="images/case-study-1/538-graphic.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ### Some People Are Too Superstitious To Have A Baby On Friday The 13th [Story link](https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/) .huger[Can we replicate this plot?] ] --- class: your-turn # Your turn 1 .pull-left[ .big[ - Take a look at **`US_births_1994_2003`** - Discuss in the chat: Brainstorm the steps needed to get the data into a form ready to make the plot. ] ] .pull-right[ <img src="images/case-study-1/538-graphic.png" width="90%" style="display: block; margin: auto;" /> ]
05
:
00
--- .panelset[ .panel[.panel-name[Code] ```r US_births_1994_2003 %>% filter(year == 1994) %>% ggplot(mapping = aes(x = date, y = births)) + geom_line() ``` ] .panel[.panel-name[Plot] <img src="images/case-study-1/plots/year-line-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- .pull-left[ <img src="images/case-study-1/538-graphic.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ Data required to make the plot <img src="images/case-study-1/answer-preview.png" width="100%" style="display: block; margin: auto;" /> ] .right-float[\* using slightly different data] ??? x-axis: day of week y-axis: some calculated value --- background-image: url(images/case-study-1/steps1.png) background-position: center middle background-size: 95% --- background-image: url(images/case-study-1/steps2.png) background-position: center middle background-size: 95% --- background-image: url(images/case-study-1/steps3.png) background-position: center middle background-size: 95% --- # One potential process * Get just the data for the 6th, 13th, and 20th * Calculate the variable of interest * For each month/year: * Find average births on the 6th and 20th * Find *percentage difference* between births on the 13th and the average births on the 6th and 20th * Average *percentage difference* by day of the week * Create plot --- class: your-turn # Your turn 2 .big[ * Remove the **`date`** field. ** It is redundant and will get in the way later. * Extract data for just the 6th, 13th, and 20th of each month. ]
03
:
00
--- class: your-turn ```r US_births_1994_2003 %>% select(-date) %>% filter(date_of_month %in% c(6, 13, 20)) #> # A tibble: 360 x 5 #> year month date_of_month day_of_week births #> <int> <int> <int> <ord> <int> #> 1 1994 1 6 Thurs 11406 #> 2 1994 1 13 Thurs 11212 #> 3 1994 1 20 Thurs 11682 #> 4 1994 2 6 Sun 8309 #> 5 1994 2 13 Sun 8171 #> 6 1994 2 20 Sun 8402 #> 7 1994 3 6 Sun 8389 #> 8 1994 3 13 Sun 8248 #> 9 1994 3 20 Sun 8243 #> 10 1994 4 6 Wed 11811 #> # … with 350 more rows ``` --- .big[ Two options for arranging the data. Which is tidy? ] **Option 1**: days in rows ``` #> # A tibble: 3 x 5 #> year month date_of_month day_of_week births #> <int> <int> <int> <ord> <int> #> 1 1994 1 6 Thurs 11406 #> 2 1994 1 13 Thurs 11212 #> 3 1994 1 20 Thurs 11682 ``` **Option 2**: days in columns ``` #> # A tibble: 1 x 6 #> year month day_of_week `6` `13` `20` #> <int> <int> <ord> <int> <int> <int> #> 1 1994 1 Thurs 11406 11212 11682 ``` --- class: your-turn # Your turn 3 .big[ Which arrangement is tidy? ] **Hint**: Think about our next step, *"Find the percentage difference between the 13th and the average of the 6th and 20th."* In which layout will this be easier using our tidy tools?
02
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Option 1] ``` #> # A tibble: 3 x 5 #> year month date_of_month day_of_week births #> <int> <int> <int> <ord> <int> #> 1 1994 1 6 Thurs 11406 #> 2 1994 1 13 Thurs 11212 #> 3 1994 1 20 Thurs 11682 ``` Next step, we'd have to write a custom function to summarize these three rows, relying on order, or subsetting to reference dates. **NOT TIDY.** ] .panel[.panel-name[Option 2] ``` #> # A tibble: 1 x 6 #> year month day_of_week `6` `13` `20` #> <int> <int> <ord> <int> <int> <int> #> 1 1994 1 Thurs 11406 11212 11682 ``` Next step, we can mutate directly referring to columns for days. **TIDY!** ] ] --- class: your-turn # Your turn 4 .big[ Tidy the filtered data to have the days in columns. ] ``` #> # A tibble: 1 x 6 #> year month day_of_week `6` `13` `20` #> <int> <int> <ord> <int> <int> <int> #> 1 1994 1 Thurs 11406 11212 11682 ```
03
:
00
--- class: your-turn ```r US_births_1994_2003 %>% select(-date) %>% filter(date_of_month %in% c(6, 13, 20)) %>% pivot_wider(names_from = date_of_month, values_from = births) #> # A tibble: 120 x 6 #> year month day_of_week `6` `13` `20` #> <int> <int> <ord> <int> <int> <int> #> 1 1994 1 Thurs 11406 11212 11682 #> 2 1994 2 Sun 8309 8171 8402 #> 3 1994 3 Sun 8389 8248 8243 #> 4 1994 4 Wed 11811 11428 11585 #> 5 1994 5 Fri 11904 11085 11645 #> 6 1994 6 Mon 11130 10692 11337 #> 7 1994 7 Wed 13086 12134 12378 #> 8 1994 8 Sat 9336 9474 9646 #> 9 1994 9 Tues 11448 12560 12584 #> 10 1994 10 Thurs 12017 11398 11876 #> # … with 110 more rows ``` --- class: your-turn # Your turn 5 .big[ Now use **`mutate()`** to add columns for: - The average of the births on the 6th and 20th. - The percentage difference between the number of births on the 13th and the average of the 6th and 20th. **Hint:** You need to use backticks,`` ` ``, around the days, e.g.,`` `6` ``,`` `13` ``, and`` `20` `` to specify the column names. ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Solution] ```r births_diff_13 <- US_births_1994_2003 %>% select(-date) %>% filter(date_of_month %in% c(6, 13, 20)) %>% pivot_wider(names_from = date_of_month, values_from = births) %>% mutate(avg_6_20 = (`6` + `20`) / 2, diff_13 = ((`13` - avg_6_20) / avg_6_20) * 100) ``` ] .panel[.panel-name[Output] ```r births_diff_13 #> # A tibble: 120 x 8 #> year month day_of_week `6` `13` `20` avg_6_20 diff_13 #> <int> <int> <ord> <int> <int> <int> <dbl> <dbl> #> 1 1994 1 Thurs 11406 11212 11682 11544 -2.88 #> 2 1994 2 Sun 8309 8171 8402 8356. -2.21 #> 3 1994 3 Sun 8389 8248 8243 8316 -0.818 #> 4 1994 4 Wed 11811 11428 11585 11698 -2.31 #> 5 1994 5 Fri 11904 11085 11645 11774. -5.86 #> 6 1994 6 Mon 11130 10692 11337 11234. -4.82 #> 7 1994 7 Wed 13086 12134 12378 12732 -4.70 #> 8 1994 8 Sat 9336 9474 9646 9491 -0.179 #> 9 1994 9 Tues 11448 12560 12584 12016 4.53 #> 10 1994 10 Thurs 12017 11398 11876 11946. -4.59 #> # … with 110 more rows ``` ] ] --- class: center Let's take a look at the distribution of average difference by day of the week. <img src="images/case-study-1/plots/births-dist-1.png" width="80%" style="display: block; margin: auto;" /> --- class: center Let's take a look at the distribution of average difference by day of the week. <img src="images/case-study-1/plots/births-dist-highlight-1.png" width="80%" style="display: block; margin: auto;" /> --- ```r births_diff_13 %>% filter(diff_13 > 10) #> # A tibble: 1 x 8 #> year month day_of_week `6` `13` `20` avg_6_20 diff_13 #> <int> <int> <ord> <int> <int> <int> <dbl> <dbl> #> 1 1999 9 Mon 8249 11481 11961 10105 13.6 ``` ??? Big difference is the 6th. September 6, 1999 = Labor day. Holiday effect? --- class: your-turn # Your turn 6 .pull-left[ .big[ Calculate the average **`diff_13`** for each day of the week. Then recreate a bar graph to mimic the FiveThirtyEight plot. ] ] .pull-right[ <img src="images/case-study-1/538-graphic.png" width="90%" style="display: block; margin: auto;" /> ]
05
:
00
--- class: your-turn .panelset[ .panel[.panel-name[Calculation] ```r births_13 <- births_diff_13 %>% group_by(day_of_week) %>% summarize(avg_diff_13 = mean(diff_13)) births_13 #> # A tibble: 7 x 2 #> day_of_week avg_diff_13 #> <ord> <dbl> #> 1 Sun -0.303 #> 2 Mon -2.69 #> 3 Tues -1.38 #> 4 Wed -3.27 #> 5 Thurs -3.01 #> 6 Fri -6.81 #> 7 Sat -0.738 ``` ] .panel[.panel-name[Plot Code] ```r ggplot(births_13, mapping = aes(x = day_of_week, y = avg_diff_13)) + geom_col() ``` ] .panel[.panel-name[Plot] <img src="images/case-study-1/plots/yt-sum-diff-plot-create-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- class: center middle # styling --- .pull-left[ <img src="images/case-study-1/plots/yt-sum-diff-plot-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/case-study-1/538-graphic.png" width="100%" style="display: block; margin: auto;" /> ] ??? First - reorder x-axis, add variable for color. --- .panelset[ .panel[.panel-name[Code] ```r births_13 %>% mutate(day_of_week = fct_relevel(day_of_week, "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"), friday = day_of_week == "Fri") %>% ggplot(mapping = aes(x = day_of_week, y = avg_diff_13)) + geom_col(mapping = aes(fill = friday)) ``` ] .panel[.panel-name[Plot] <img src="images/case-study-1/plots/new-vars-1.png" width="80%" style="display: block; margin: auto;" /> ] ] ??? Reorder x-axis, map fill to `friday`. --- .panelset[ .panel[.panel-name[Code] ```r births_13 %>% mutate(day_of_week = fct_relevel(day_of_week, "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"), friday = day_of_week == "Fri") %>% ggplot(mapping = aes(x = day_of_week, y = avg_diff_13)) + geom_col(mapping = aes(fill = friday), show.legend = FALSE) + scale_fill_manual(values = c("#F2B5ED", "#F200DF")) ``` ] .panel[.panel-name[Plot] <img src="images/case-study-1/plots/new-colors-create-1.png" width="80%" style="display: block; margin: auto;" /> ] ] ??? Remove legend, specify colors. --- .pull-left[ <img src="images/case-study-1/plots/new-colors-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/case-study-1/538-graphic.png" width="100%" style="display: block; margin: auto;" /> ] --- .panelset[ .panel[.panel-name[Code] ```r births_13 %>% mutate(day_of_week = fct_relevel(day_of_week, "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"), friday = day_of_week == "Fri") %>% ggplot(mapping = aes(x = day_of_week, y = avg_diff_13)) + geom_col(mapping = aes(fill = friday), show.legend = FALSE) + scale_fill_manual(values = c("#F2B5ED", "#F200DF")) + scale_x_discrete(position = "top") + scale_y_continuous(breaks = seq(-6, 0, by = 1), labels = c("-6", "-5", "-4", "-3", "-2", "-1", "0 ppt")) + labs(x = NULL, y = NULL) ``` ] .panel[.panel-name[Plot] <img src="images/case-study-1/plots/new-axes-1.png" width="80%" style="display: block; margin: auto;" /> ] ] ??? Now, the axes. --- .panelset[ .panel[.panel-name[Code] ```r births_13 %>% mutate(day_of_week = fct_relevel(day_of_week, "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"), friday = day_of_week == "Fri") %>% ggplot(mapping = aes(x = day_of_week, y = avg_diff_13)) + geom_col(mapping = aes(fill = friday), show.legend = FALSE) + scale_fill_manual(values = c("#F2B5ED", "#F200DF")) + scale_x_discrete(position = "top") + scale_y_continuous(breaks = seq(-6, 0, by = 1), labels = c("-6", "-5", "-4", "-3", "-2", "-1", "0 ppt")) + labs(x = NULL, y = NULL) + theme(axis.ticks = element_blank(), axis.text.y = element_text(hjust = 0)) ``` ] .panel[.panel-name[Plot] <img src="images/case-study-1/plots/refine-axes-1.png" width="80%" style="display: block; margin: auto;" /> ] ] ??? Refine the axes. --- background-image: url(images/case-study-1/ggplot2-theme-elements.png) background-position: center middle background-size: 85% --- .pull-left[ <img src="images/case-study-1/plots/refine-axes-print-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/case-study-1/538-graphic.png" width="100%" style="display: block; margin: auto;" /> ] --- .panelset[ .panel[.panel-name[Code] ```r births_13 %>% mutate(day_of_week = fct_relevel(day_of_week, "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"), friday = day_of_week == "Fri") %>% ggplot(mapping = aes(x = day_of_week, y = avg_diff_13)) + geom_col(mapping = aes(fill = friday), show.legend = FALSE) + scale_fill_manual(values = c("#F2B5ED", "#F200DF")) + scale_x_discrete(position = "top") + scale_y_continuous(breaks = seq(-6, 0, by = 1), labels = c("-6", "-5", "-4", "-3", "-2", "-1", "0 ppt")) + labs(x = NULL, y = NULL, title = "The Friday the 13th effect", subtitle = "Difference in the share of U.S. births on the 13th of each month\nfrom the average of births on the 6th and the 20th, 1994-2003") + theme(axis.ticks = element_blank(), axis.text.y = element_text(hjust = 0), plot.title = element_text(face = "bold", size = rel(1.5))) ``` ] .panel[.panel-name[Plot] <img src="images/case-study-1/plots/add-title-1.png" width="80%" style="display: block; margin: auto;" /> ] ] ??? Title and subtitle. --- .panelset[ .panel[.panel-name[Code] ```r births_13 %>% mutate(day_of_week = fct_relevel(day_of_week, "Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"), friday = day_of_week == "Fri") %>% ggplot(mapping = aes(x = day_of_week, y = avg_diff_13)) + geom_col(mapping = aes(fill = friday), show.legend = FALSE) + scale_fill_manual(values = c("#F2B5ED", "#F200DF")) + scale_x_discrete(position = "top") + scale_y_continuous(breaks = seq(-6, 0, by = 1), labels = c("-6", "-5", "-4", "-3", "-2", "-1", "0 ppt")) + labs(x = NULL, y = NULL, title = "The Friday the 13th effect", subtitle = "Difference in the share of U.S. births on the 13th of each month\nfrom the average of births on the 6th and the 20th, 1994-2003") + theme(axis.ticks = element_blank(), axis.text.y = element_text(hjust = 0), plot.title = element_text(face = "bold", size = rel(1.5)), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "#F0F0F0"), plot.background = element_rect(fill = "#F0F0F0"), panel.grid.major = element_line(color = "#DBDBDB")) ``` ] .panel[.panel-name[Plot] <img src="images/case-study-1/plots/refine-colors-1.png" width="80%" style="display: block; margin: auto;" /> ] ] ??? Color adjustment in the theme. --- .pull-left[ <img src="images/case-study-1/plots/refine-colors-compare-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="images/case-study-1/538-graphic.png" width="100%" style="display: block; margin: auto;" /> ] --- class: your-turn # Extra Challenges .big[ If you wanted to use the **`US_births_2000_2014`** data instead, what would you need to change in the pipeline? How about using both **`US_births_1994_2003`** and **`US_births_2000_2014`**? Try not removing the date column. At what point in the pipeline does it cause problems? Why? ]
05
:
00
--- class: title-slide, center # Case Study <img src="images/case-study-1/tidyr-dplyr-ggplot2.png" width="40%" style="display: block; margin: auto;" /> ## Tidy Data Science with the Tidyverse and Tidymodels ### W. Jake Thompson #### [https://tidyds-2021.wjakethompson.com](https://tidyds-2021.wjakethompson.com) · [https://bit.ly/tidyds-2021](https://bit.ly/tidyds-2021) .footer-license[*Tidy Data Science with the Tidyverse and Tidymodels* is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).]