Chapter 3 Setup

3.2 Rstudio shortcuts

  • Another tip to speed up coding in RStudio is to use keyboard shortcuts. Here is the full list of RStudio shorcuts.

  • I use Ctrl+Shift+M in Windows (Cmd+Shift+M in Mac) to insert a pipe %>% with spaces which makes tidyverse coding fast.

  • I also use Alt+- in Windows (Option+- in Mac) to create the assign operator <-.

3.3 Load R packages to your library

Here are all the R packages used in this book.

# data vis https://www.htmlwidgets.org/index.html
library(ggplot2) # static charts - amazing variety
library(plotly) # interactive charts
library(apexcharter) # much like the paid for highchater js library. Powerful
library(trelliscopejs) #  small multiples
library(dygraphs) # awesome interactive timeseries

# ggplot extensions https://www.ggplot2-exts.org/gallery/
library(scales) # improve your scales
library(gganimate) # animate your ggplot
library(png) # 
library(directlabels) # for directly labelling lines for example
library(gghighlight) # label points of interest on your charts
library(ggrepel) # move labels so they don't overlap
library(glue) # include column names in a looping chart title
library(crosstalk) # link html widgets like plotly to DT tables
library(skimr) # quickly skim the characteristics of a data frame
library(cowplot) # 

# colours
library(RColorBrewer) # ready to use colour pallets
library(prismatic) # palettes and ways to tweak an entire pallete

# tables
library(kableExtra) # attractive static tables
library(formattable) # colour tables. Like Excel's conditional formattinng
library(rpivotTable) # like Excel's pivot table
library(DT) # awesome interactive tables
library(reactable)

# Quality Assurance
library(tidylog) # great for QA on the fly 
library(testthat) # for writing unit tests
library(assertr) # for writing tests verify assumptions about data early in the data pipeline of manipulation

# wrangling / munging / manipulating
library(tidyverse) # loads of useful packages in one 

# time series tools
library(anytime) # convert text into the right date type
library(tsbox) # convert to time series for dygraphs

# sample data
library(mosaicData) # sample data
library(eventdataR) # sample process mining data sets

# Processing mining
library(bupaR) # the process mining package that converts your data into an event log
library(edeaR) # good for filtering and subsetting a process event log
library(processmapR) # convert an event log into a visual process map
library(processanimateR) # animate a process map
library(processcheckR) # carry out conformance checking (i.e. a process follows certain rules)
library(heuristicsmineR) # Process discovery algorithm
library(fuzzymineR) # Process discovery algorithm

3.4 The data to visualise

  • Often in EDA and data visualisation, we are most interested in how values change over time. For example, are things getting worse, better, higher, lower? For this reason, instead of using a built in data set like iris or mtcars we mostly use the Texas housing sales data in this book. This is a time-series data set built into ggplot.

  • To improve your skills I recommend finding other open data sets to develop your data vis skills. I found the marriage data we look at later in this comprehensive list of R datasets. The Tidy Tuesday data sets are also a great source of data sets. You also get to learn from seeing how others (including David Robinson) have tackled them.

  • You could also look for data stories that interest you or that you could improve and explore further. I recently saw a data visualisation in The Times newspaper on smoking and vaping that I tried to improve.

  • The only manipulation we will do to the Texas housing sales data below is to create a real date using lubridate. This book doesn’t cover the data manipulation part of EDA. A good guide is the Managing Data Frames with the dplyr package chapter in Roger Peng’s Exploratory Data Analysis with R. As well as the resources described in bullet 1 here.

  • Here are the top few rows of the data frame we created using a simple kableExtra table. The Tables chapter later on in this book describes different table methods in more detail.
city year month sales volume median listings inventory date
Abilene 2000 1 72 5380000 71400 701 6.3 2000-01-01
Abilene 2000 2 98 6505000 58700 746 6.6 2000-02-01
Abilene 2000 3 130 9285000 58100 784 6.8 2000-03-01
Abilene 2000 4 98 9730000 68600 785 6.9 2000-04-01
Abilene 2000 5 141 10590000 67300 794 6.8 2000-05-01
Abilene 2000 6 156 13910000 66900 780 6.6 2000-06-01

3.5 QA and Unit tests

3.5.1 tidylog

  • We’ll also use tidylog with dplyr. Tidylog is an easy way to include fast basic Quality Assurance during dplyr data manipulation. Just replace the dplyr verbs with the same verb in tidylog. Tidylog tells you what each dplyr verb has done to your data. For example, how many rows dplyr::filter() drops or adds, or what percentage of values dplyr::mutate() has changed.
## mutate: changed 8,602 values (100%) of 'date' (0 new NA)
  • The tidylog message above tells us that the mutate has changed (100%) of 'date' (0 new NA). This is what we would expect to happen.

  • Let’s also create a data frame with fewer cities so that some plots created later will be less crowded. Below we select only cities where the maximum number of sales in any month is more than 500.

## mutate (grouped): new variable 'sales_max' with 27 unique values and 43% NA
## filter: removed 6,171 rows (72%), 2,431 rows remaining
  • Hold on. Look! Read the tidylog message above. It says new variable 'sales_max' with 27 unique values and 43% NA. As we filter the rows in the next line of code by sales_max we don’t want any of the values to be NA. Well spotted tidylog. We correct it below by removing NAs from salses_max by adding na.rm = TRUE inside the base:max() function.
## mutate (grouped): new variable 'sales_max' with 45 unique values and 0% NA
## filter: removed 5,984 rows (70%), 2,618 rows remaining
  • The tidylog message above now tells us that the new variable sales_max has 0% NA as we would expect.

3.5.2 testthat

  • Unit testing is mostly used during the creation of R packages. It can also be part of your analysis code. Particularly if your work is high impact and you need to be confident the results are still valid in the future. For example, you are running new data through your code or when you have added new logic. Unit tests are an automatic way to highlight errors new data or new logic has introduced you may not have spotted.

  • Hadley Wickham provides some useful “what to test” rules of thumb for when to use unit tests in his R packages book:

Strive to test each behaviour in one and only one test. Then if that behaviour later changes you only need to update a single test.

Avoid testing simple code that you’re confident will work. Instead focus your time on code that you’re not sure about, is fragile, or has complicated interdependencies. That said, I often find I make the most mistakes when I falsely assume that the problem is simple and doesn’t need any tests.

Always write a test when you discover a bug. You may find it helpful to adopt the test-first philosophy. There you always start by writing the tests, and then write the code that makes them pass. This reflects an important problem solving strategy: start by establishing your success criteria, how you know if you’ve solved the problem.

  • In this unit test we’re immediately going to break Hadley’s second rule above and write a test on the simple code that created our dataframe df. The unit tests are all adapted from an excellent introductory blog on unit tests. The code has been altered to follow the three Code style rules described in the next chapter to make these examples easier to read and understand.

  • First, let’s test if the dimensions of df are as we expect with 9 columns and 8,602 rows.

  • This testthat::test_that() test does not return a message, so we know df has passed because the dimensions are as we expect. Let’s now test data types.
  • Again, this test no messages so df has passed this test. Finally, let’s run a test we know will fail to see what failure message returns.
## Error: Test failed: 'no missing values'
## * <text>:3: `df` not identical to stats::na.omit(df).
## Different number of rows
  • The code above tests if the original df object is identical to the expected object after omitting rows with NA values (using stats::na.omit()). The testthat message tells us the object and the expectation have a Different number of rows. We can now investigate this test failure by eye-balling some rows of df where any of the columns contain a missing value. Then decide if this is what we would expect from this data source.
city year month sales volume median listings inventory date
Abilene 2006 3 155 18030000 105400 NA NA 2006-03-01
Amarillo 2005 10 67 7225000 87000 NA NA 2005-10-01
Amarillo 2006 6 345 46755000 120700 NA NA 2006-06-01
Amarillo 2006 9 288 37365000 114100 NA NA 2006-09-01
Amarillo 2007 11 253 32910000 114800 NA NA 2007-11-01
Amarillo 2010 5 333 45797490 122100 NA NA 2010-05-01

3.5.3 assertr

  • You can also run tests piped directly within your data manipulation code using assertr. This assertr vignette has good examples on built-in data, as well as the assert section at the bottom of this good blog.

  • Below we purposely ensure the assertr tests will fail all the tests so we can read the useful error output it can create.

## verification [base::nrow(.) == 8601] failed! (1 failure)
## 
##     verb redux_fn             predicate column index value
## 1 verify       NA base::nrow(.) == 8601     NA     1    NA
## 
## verification [base::ncol(.) == 8] failed! (1 failure)
## 
##     verb redux_fn          predicate column index value
## 1 verify       NA base::ncol(.) == 8     NA     1    NA
## 
## Column 'month' violates assertion 'in_set(1:11)' 690 times
##     verb redux_fn    predicate column index value
## 1 assert       NA in_set(1:11)  month    12    12
## 2 assert       NA in_set(1:11)  month    24    12
## 3 assert       NA in_set(1:11)  month    36    12
## 4 assert       NA in_set(1:11)  month    48    12
## 5 assert       NA in_set(1:11)  month    60    12
##   [omitted 685 rows]
## 
## 
## Column 'year' violates assertion 'in_set(2000:2014)' 322 times
##     verb redux_fn         predicate column index value
## 1 assert       NA in_set(2000:2014)   year   181  2015
## 2 assert       NA in_set(2000:2014)   year   182  2015
## 3 assert       NA in_set(2000:2014)   year   183  2015
## 4 assert       NA in_set(2000:2014)   year   184  2015
## 5 assert       NA in_set(2000:2014)   year   185  2015
##   [omitted 317 rows]
## 
## 
## Column 'sales' violates assertion 'within_n_sds(3)' 288 times
##     verb redux_fn       predicate column index value
## 1 insist       NA within_n_sds(3)  sales  2060  4244
## 2 insist       NA within_n_sds(3)  sales  2061  3977
## 3 insist       NA within_n_sds(3)  sales  2062  4545
## 4 insist       NA within_n_sds(3)  sales  2063  4738
## 5 insist       NA within_n_sds(3)  sales  2064  4276
##   [omitted 283 rows]
## 
## 
## Column 'volume' violates assertion 'within_n_sds(3)' 254 times
##     verb redux_fn       predicate column index      value
## 1 insist       NA within_n_sds(3) volume   722  944924177
## 2 insist       NA within_n_sds(3) volume   723  880058970
## 3 insist       NA within_n_sds(3) volume   724  993168216
## 4 insist       NA within_n_sds(3) volume   725  931481472
## 5 insist       NA within_n_sds(3) volume   734 1012123948
##   [omitted 249 rows]
## 
## 
## Column 'median' violates assertion 'within_n_sds(3)' 89 times
##     verb redux_fn       predicate column index  value
## 1 insist       NA within_n_sds(3) median   734 243900
## 2 insist       NA within_n_sds(3) median   735 248900
## 3 insist       NA within_n_sds(3) median   736 246900
## 4 insist       NA within_n_sds(3) median   737 243800
## 5 insist       NA within_n_sds(3) median   740 242100
##   [omitted 84 rows]
## 
## 
## Column 'listings' violates assertion 'within_n_sds(3)' 255 times
##     verb redux_fn       predicate   column index value
## 1 insist       NA within_n_sds(3) listings  2086 21153
## 2 insist       NA within_n_sds(3) listings  2087 22043
## 3 insist       NA within_n_sds(3) listings  2088 23165
## 4 insist       NA within_n_sds(3) listings  2089 23056
## 5 insist       NA within_n_sds(3) listings  2090 23551
##   [omitted 250 rows]
## 
## 
## Column 'inventory' violates assertion 'within_n_sds(3)' 187 times
##     verb redux_fn       predicate    column index value
## 1 insist       NA within_n_sds(3) inventory  3108  21.9
## 2 insist       NA within_n_sds(3) inventory  3109  23.2
## 3 insist       NA within_n_sds(3) inventory  3115  22.8
## 4 insist       NA within_n_sds(3) inventory  3118  21.9
## 5 insist       NA within_n_sds(3) inventory  3119  22.0
##   [omitted 182 rows]
## 
## 
## Column 'sales' violates assertion 'not_na' 568 times
##     verb redux_fn predicate column index value
## 1 assert       NA    not_na  sales  1144    NA
## 2 assert       NA    not_na  sales  1159    NA
## 3 assert       NA    not_na  sales  1160    NA
## 4 assert       NA    not_na  sales  1161    NA
## 5 assert       NA    not_na  sales  1162    NA
##   [omitted 563 rows]
## 
## 
## Column 'volume' violates assertion 'not_na' 568 times
##     verb redux_fn predicate column index value
## 1 assert       NA    not_na volume  1144    NA
## 2 assert       NA    not_na volume  1159    NA
## 3 assert       NA    not_na volume  1160    NA
## 4 assert       NA    not_na volume  1161    NA
## 5 assert       NA    not_na volume  1162    NA
##   [omitted 563 rows]
## 
## 
## Column 'median' violates assertion 'not_na' 616 times
##     verb redux_fn predicate column index value
## 1 assert       NA    not_na median  1144    NA
## 2 assert       NA    not_na median  1159    NA
## 3 assert       NA    not_na median  1160    NA
## 4 assert       NA    not_na median  1161    NA
## 5 assert       NA    not_na median  1162    NA
##   [omitted 611 rows]
## 
## 
## Column 'listings' violates assertion 'not_na' 1424 times
##     verb redux_fn predicate   column index value
## 1 assert       NA    not_na listings    75    NA
## 2 assert       NA    not_na listings   257    NA
## 3 assert       NA    not_na listings   265    NA
## 4 assert       NA    not_na listings   268    NA
## 5 assert       NA    not_na listings   282    NA
##   [omitted 1419 rows]
## 
## 
## Column 'inventory' violates assertion 'not_na' 1467 times
##     verb redux_fn predicate    column index value
## 1 assert       NA    not_na inventory    75    NA
## 2 assert       NA    not_na inventory   257    NA
## 3 assert       NA    not_na inventory   265    NA
## 4 assert       NA    not_na inventory   268    NA
## 5 assert       NA    not_na inventory   282    NA
##   [omitted 1462 rows]
## Error: assertr stopped execution
  • Then we adjust all of the tests so that they pass.
## # A tibble: 6 x 9
##   city     year month sales   volume median listings inventory date      
##   <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <date>    
## 1 Abilene  2000     1    72  5380000  71400      701       6.3 2000-01-01
## 2 Abilene  2000     2    98  6505000  58700      746       6.6 2000-02-01
## 3 Abilene  2000     3   130  9285000  58100      784       6.8 2000-03-01
## 4 Abilene  2000     4    98  9730000  68600      785       6.9 2000-04-01
## 5 Abilene  2000     5   141 10590000  67300      794       6.8 2000-05-01
## 6 Abilene  2000     6   156 13910000  66900      780       6.6 2000-06-01
  • The code above returns the data frame which means assertr code could be part of your data preparation code when df was created.

3.6 Code style

3.6.1 Rule 1: Include names before all functions

  • Include the package name before every function, even for base R functions, to make your code easier to read,

  • I started learning R by reading other people’s code. But I got confused where functions came from. Particularly when they used functions from lots of different packages mixed with base functions and their own functions, but without the package name shown. It made their code appear more complicated and intimidating than it was.

  • Another benefit of including the package name is hitting the tab key at the end of the double colon let’s us browse all the package’s functions in a scrollable list.

  • Also, try highlighting a function and its package name then hit the F1 key to show the help page. While hitting the F2 key can be a quick way to see all the possible arguments in a function. As well as seeing the default argument values the package author has chosen for you.

3.6.2 Rule 2: Add argument names to all functions

  • Instead of relying on the order of the arguments the function expects, name every argument when you set its value. Your code will be quicker to understand when we know which arguments are set without having to read the help page for that function.

3.6.3 Rule 3: Each line does one thing

  • Each line of code should do one thing. You can do this by hitting the return key after every pipe %>%, comma, or plus +. RStudio will then automatically indent your code in just the right way. You can also indent old code by blocking on to it and using the RStudio shorcut Ctrl+Shift+A in Windows (Command+Shift+A in Mac).

  • The code below works but doesn’t follow the three rules. It is cramped, slow to understand, and can be difficult to edit or re-use.

  • In contrast, the code below is identical to the code above, but it follows the three rules so you can more easily:
  1. Run your code top downwards in chunks adding more lines each time. Much like the popular ggplot flip-books I describe at bullet 11 in my collection of R guides and galleries.

  2. Comment out one line of code (or just one argument in a function) to understand what it does when you run the code without it.

  3. Find and edit arguments in a function (e.g. a font size)

  4. Help others QA or re-use your code.

  5. Help others new to R understand what you are doing and learn faster.

  6. Avoid being a gatekeeper of your R knowledge and skills. Share, explain and democratise what you know. You can then move on to more sophisticated analysis in R with an even higher value. Richard Susskind advocates this in The Future of Professions) and in my example of describing NLP techniques intuitively.

  • Further tweaks to improve code clarity are to put spaces either side of the equals sign and to use styler to apply some of these rules automatically. You can also customise styler to apply your own rules. I want to try customising styler to apply rules 1 to 3 above. I don’t know if that’s possible yet.
  • My preference is for the code style to let you imagine data being “poured” into ggplot through the pipe. Then we tell ggplot which columns to map, then the geom, followed by the same co-ordinate and facet functions as Wickham recommends.

  • This is inspired by the code layout of the many flipbook guides (see bullet 11 here for examples).