Chapter 8 Process Mining

8.1 What is Process Mining?

  • Process mining techniques can visualise the many paths that cases can take in organisations. For example, a patient being treated in a hospital or a complaint handled by a train company. However, seeing every path cases take in a process map is often overwhelming. Fortunately, there are many ways to simplify process maps we explore in this chapter.

  • Data science techniques can also be applied to process mined data. For example, predicting the time it will take a case to complete a process, or recommending the next best step that should shorten the time to a case completing.

  • The Coursera Process Mining course and accompanying Process Mining: Data Science in Action book provide a detailed background to these techniques and their applications in different industries. There is also a free to read Process Mining in Practice book by the same author.

  • The minimum data needed for process mining are two columns that record:

    1. Activity: The activities (or events) that took place in the process.
    2. Date: The date (and perhaps time) each activity occurred.

  • For example, knowing how and when a complaint is handled in different ways are the two minimum pieces of information needed for process mining in data. Three further data items will offer more insight:

    1. Resource: The person (or system) that carried out each activity in the process. This is known as the “resource”.
    2. Lifecycle: The transaction lifecycle of the case at each activity. For example, is the case at “start”, “in progress”, or “complete”.
    3. State: The state the case was moved into by each activity. This is usually a greater level of detail below the broad transaction lifecycle categories.

  • For example, the data could tell us the customer is the resource who carried out the payment activity which moves its transaction lifecycle to “complete” and into the state “paid”.

8.2 The event log

8.2.1 Understanding the event log

  • The eventdataR package contains both artificial and real life event logs. It comes from a family of process mining packages called bupaR which stands for Business Process Analysis with R. The bupaR cheatsheet summaries the key functions from the family of packages in one clear page.

  • Let’s walk through key process mining techniques in a logical order using the hospital_billing event log from eventdataR. How it was created and anonymised is described here. Also, the academic paper Data-driven process discovery: revealing conditional infrequent behavior from event logs explores the same hospital billing event log in figure 6.

  • The example hospital billing event log has already been created for us. Putting it into the utils::str() function shows us the structure of this R object. The output shows that log holds number of cases, activities, resources, and “traces”. Traces are the unique paths cases take through different activities. Resources are typically the names of people who carried out each activity.

## Classes 'eventlog', 'tbl_df', 'tbl' and 'data.frame':    49951 obs. of  25 variables:
##  $ case_id             : chr  "A" "A" "A" "A" ...
##  $ activity            : Factor w/ 18 levels "BILLED","CHANGE DIAGN",..: 12 9 14 6 1 12 7 12 9 14 ...
##  $ lifecycle           : Factor w/ 1 level "complete": 1 1 1 1 1 1 1 1 1 1 ...
##  $ resource            : Factor w/ 1150 levels "ResA","ResAA",..: 1 NA NA NA 46 1 91 1 NA NA ...
##  $ timestamp           : POSIXct, format: "2012-12-16 19:33:10" "2013-12-15 19:00:37" ...
##  $ actorange           : chr  NA NA NA "false" ...
##  $ actred              : chr  NA NA NA "false" ...
##  $ blocked             : chr  "false" NA NA NA ...
##  $ casetype            : chr  "A" NA NA NA ...
##  $ closecode           : chr  NA "A" NA NA ...
##  $ diagnosis           : chr  "A" NA NA NA ...
##  $ flaga               : chr  "false" NA NA NA ...
##  $ flagb               : chr  "false" NA NA NA ...
##  $ flagc               : chr  NA NA NA "false" ...
##  $ flagd               : chr  "true" NA NA NA ...
##  $ iscancelled         : chr  "false" NA NA NA ...
##  $ isclosed            : chr  "true" NA NA NA ...
##  $ msgcode             : chr  NA NA NA NA ...
##  $ msgcount            : int  NA NA NA 0 NA NA NA NA NA NA ...
##  $ msgtype             : chr  NA NA NA NA ...
##  $ speciality          : chr  "A" NA NA NA ...
##  $ state               : chr  "In progress" "Closed" "Released" NA ...
##  $ version             : chr  NA NA NA "A" ...
##  $ activity_instance_id: chr  "1" "2" "3" "4" ...
##  $ .order              : int  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "case_id")= chr "case_id"
##  - attr(*, "activity_id")= chr "activity"
##  - attr(*, "activity_instance_id")= chr "activity_instance_id"
##  - attr(*, "lifecycle_id")= chr "lifecycle"
##  - attr(*, "resource_id")= chr "resource"
##  - attr(*, "timestamp")= chr "timestamp"
  • Also, at the bottom of the output above, the event log has six additional attributes:
    1. case_id a unique case id to tell us which activity rows belong to each case.
    2. activity a description of what happened.
    3. activity_instance_id a unique id for each activity.
    4. lifecycle the status of the case at each activity such as “complete”.
    5. resource who did the activity.
    6. timestamp when the activity happened.
  • We return just those attributes below..
## List of 9
##  $ names               : chr [1:25] "case_id" "activity" "lifecycle" "resource" ...
##  $ row.names           : int [1:49951] 1 2 3 4 5 6 7 8 9 10 ...
##  $ case_id             : chr "case_id"
##  $ activity_id         : chr "activity"
##  $ activity_instance_id: chr "activity_instance_id"
##  $ lifecycle_id        : chr "lifecycle"
##  $ resource_id         : chr "resource"
##  $ timestamp           : chr "timestamp"
##  $ class               : chr [1:4] "eventlog" "tbl_df" "tbl" "data.frame"
  • We would typically expect cases to be in one of several states as they pass through a process. Below we can see all the values of the state column. It appears that state has not been used to create a lifecycle attribute in the event log that would define when the case is “complete”. Instead, all cases have been marked as “complete” at every activity.
## # A tibble: 11 x 2
##    state                n
##    <chr>            <int>
##  1 Billable           499
##  2 Billed            7432
##  3 Check                1
##  4 Closed            7886
##  5 Empty              179
##  6 In progress      17809
##  7 Invoice rejected   229
##  8 Rejected            70
##  9 Released          7909
## 10 Unbillable          84
## 11 <NA>              7853
## # A tibble: 1 x 2
##   lifecycle     n
##   <fct>     <int>
## 1 complete  49951
  • This heatmap counts the activities and the state they were in at the time the activity was carried out.

  • The states Billed, Closed and Released are almost entirely triggered by one activity (BILLED, FIN, and RELEASE respectively). Only the state In progress is triggered by more than one activity, though it is almost entirely the activities NEW and CHANGE DIAGN, followed by DELETE, REOPEN in smaller numbers.

  • With more domain knowledge from an expert who works with the real life process it would be possible to create a useful lifecycle attribute. For example, a sensible categorisation might be to:

    1. assign the value of lifecycle to “complete” for rows with the state values Billed, Closed, Invoice rejected, Rejected, Released or Unbillable.
    2. assign the value of lifecycle to “in progress” for rows with the state values Billable and In progress, and,
    3. assign the value of lifecycle to “start” when the activity is NEW.
  • Some state values are NA. These missing values could be imputed with the last known value of state when it was not NA.

8.3 Visualise the event log

  • To quickly and intuitively understand the hospital billing event log we can explore it like any other data frame or tibble in R. First we view all the events for three cases using dplyr verbs and DT::datatable().
  • In this table the cases A and C have a final activity of BILLED. The state that the BILLED activity moves the cases into is also called Billed.

  • In contrast, for case B the BILLED activity was not carried out. Its final state in the log is DELETE. Perhaps the hospital bill for this person was withdrawn or cancelled for some reason?

8.3.1 Process map

  • We can quickly create process maps from an event log using the function processmapR::process_map().
## Warning: Prefixing `UQ()` with the rlang namespace is deprecated as of rlang 0.3.0.
## Please use the non-prefixed form or `!!` instead.
## 
##   # Bad:
##   rlang::expr(mean(rlang::UQ(var) * 100))
## 
##   # Ok:
##   rlang::expr(mean(UQ(var) * 100))
## 
##   # Good:
##   rlang::expr(mean(!!var * 100))
## 
## This warning is displayed once per session.
  • This process map displays the entire event log to show every activity that has occurred and in what order. Plotting the detail of every case path (or trace) is often overwhelming. The process can appear more complex than it really is. Maps that are too detailed have been called “spaghetti-like” by the authors of an algorithmic method we will explore later that is one of several ways to simplify process maps.

  • Process maps can be simplified in many ways. Each method helps us understand both typical and rare case paths more easily. A basic simplification is to create the process map only for a small number of cases. Below only cases A, B and C from the table above are mapped.

  • If no arguments are included in the processmapR::process_map() function the number in the box (or node) and its colour are how many cases have had that activity carried out. Darker colours represent more cases. The number on the arrows (or edges) and their thickness represent how many cases have passed between two activities in the direction of the arrow.

  • We can also subset event logs using the edeaR::trace_frequency() function so that the process map shows only the most common paths cases take . Let’s filter the log to contain only cases that have followed the top 90% most frequent paths, then pipe (%>%) that reduced log it into thee process map function.

  • The default values on a process map are the the number of cases. The process map below uses a custom profile. We have adjusted the the numbers inside each box (or node) to include both the absolute number of times the activity was carried out, and in brackets below is the percentage of those activities among all the activities carried out. This means that the percentages in the brackets across all the boxes in the map sum to 100%.

  • The labels on the arrows (or edges) have also been altered. They now show the absolute number of times cases move from each activity to the next in the direction of the arrow. The percentages in brackets on all the arrows coming out of each box will sum to 100%.

  • Next we alter the value in the brackets in the box labels to show the unique number of cases passing through that activity. The number above it (not in brackets) is usually higher for this event log. This tells us some cases must pass through that activity more than once.

  • We also alter the label on the arrows (or edges) inside the brackets to show the median average time between the two activities. The thickness of the arrow still represents the percentage of cases moving from that node.

8.3.2 Trace explorer

  • Another way to visualise the paths that cases take is with the trace explorer. This plot displays each trace (or path) sorted in descending order of frequency. The coloured squares represent the activities cases pass through in order from left to right. We have also filtered the log first to show only the top 90% most frequent paths to mirror the process map filter above.

  • Compared to the process map, the trace explorer is easier for domain experts to identify case paths that are unexpected or not allowed. This is known as “conformance checking” that is also explored later.

  • The Trace Explorer chart also allows easy identification and exploration of repeated activities that may also be unusual or unexpected. For example, in the process maps above the BILLED activity occurs more than once for some cases. We can see this where the unique number of cases in brackets in each box is a lower number than the number of cases passing through the activity that is shown above it. Below we filter the log first to only include cases where the BILLED activity occurred tow or more times. We also trim the path (or trace) to only show the first and last time the BILLED activity occurs.

  • Also, as the trace explorer is a ggplot chart, we are also able to alter the plot format in two ways. We can save it as an R object and then adjust elements of that nested list directly. And we can add on additional ggplot format settings using using the + symbol. Both methods are demonstrated below.
## filter: removed 46,721 rows (94%), 3,230 rows remaining

8.3.3 Conformance checking

  • We can also use processcheckR to identify which traces either follow or break certain rules. For example, below we narrow down the process map to only show cases that have the CHANGE_DIAGN activity carried out. This change of diagnosis activity we can imagine may be a significant event in a hospital with an impact on the final bill so would be worth checking that certain agreed processes are followed. This might include detecting both mistakes in the process and potential fraudulent activity.

8.3.4 Dotted chart

  • The dotted chart is similar in style to the trace explorer. It also displays activities in the order they occur horizontally. The main difference is that each line of dots in the dotted chart represents an individual case. The trace explorer instead displays the unique paths for one or more cases.

  • In the dotted chart below the x axis argument is set so that the position of each dot horizontally is the absolute date when each activity occurred. Also, the horizontal lines are sorted in descending order of the total trace time (or duration) of each case. The shortest duration case is at the top and the longest at the bottom.

## Joining, by = "case_id"

  • This dotted chart is very useful for insight into the hospital billing process. At the top of the plot, the pink dots show many cases which begin in early 2013 with the NEW activity have no further activities recorded. This short single activity path can also be seen in the first process map we drew in this chapter. It is the arrow that goes from the NEW box directly to the END circle without passing through any other activity boxes. However, this short path is much easier to identify with a dotted chart. It also reveals the date range in which it occurs most often. The only way to identify changes in case paths over time with the process map is to animate, or draw multiple maps from event logs that are filtered with different time periods.

  • All of the NEW activities occur in a narrow time range in early 2013. This strongly suggests the data was extracted from the live system by selecting only cases where the NEW activity occurred in early 2013.

  • Again, because the function processmapR::dotted_chart() creates a ggplot chart we can alter how it appears by adding ggplot settings or updating the R object. For example, a useful change we make to this dotted chart is to show monthly breaks on the x-axis.

## Joining, by = "case_id"

  • Let’s further explore the activity over time. The bar plot below counts the number of occurrences of the NEW activity in each month. We can see all the NEW activities occurred between January and April 2013.
## Warning: Removed 1 rows containing missing values (position_stack).

  • Next we sort the order of the rows of dots by the start date.
## Joining, by = "case_id"

  • This reveals two distinct groups of cases, one where the BILLED activity (green dots) often occurs during autumn 2013, and the other where BILLED often occurs during spring 2014. However, with so many dots plotted, it is hard to follow the path for a case individually. This is known as “over-plotting”. We can reduce over-plotting by sampling the cases with dplyr first. We can also use the plotly version of a dotted chart and hover our mouse pointer over each dot. This reveals the case id making is possible to follow the path of individual cases by hand.
## Joining, by = "case_id"

8.3.5 Resource exploration

  • The bupaR visualisations explored in this chapter so far are about the case, telling us which activities were carried out and in what order. However, we also know who carried out each activity.

  • Below we create a small data set of 30 randomly selected complete cases that begin with the NEW activity and end with the BILLED activity. We then view the activities for these 30 cases in a process map.

  • Let’s now visualise who carried out each of those activities (the resource) in a resource process map using processmapR::resourcemap(). We also add a new factor level to the resource column where no resource has been recorded. This is so that the resource map draws fully and does not stop where the resource information is missing leaving orphaned nodes (or boxes).
  • We can see from the resource process map above that the resource is often missing for certain activities. In the heatmap below, for these 30 cases, we can see on the top row are the count activities where the resource information is missing.

  • We can also plot how many resources were used on each case for the entire log using the edeaR::resource_frequency() function.
## Warning: Factor `resource` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `resource` contains implicit NA, consider using
## `forcats::fct_explicit_na`

  • And use the edeaR::resource_specialisation() function to see if some resources specialise in certain activities across all of the cases.
## Warning: Factor `resource` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `resource` contains implicit NA, consider using
## `forcats::fct_explicit_na`

  • Finally, using the processmapR::resource_matrix() we can see which resources (the antecedent) directly pass a case to another resource (consequent) to carry out the next activity just for our limited set of 30 cases.

8.3.6 Filter interactively within RStudio

  • In the central section of the bupaR cheatsheet all of the edeaR event and case filter functions are also available as interactive Shiny Gadgets. For example, running the function edeaR::ifilter_endpoints() within RStudio will create a dialog from which you can select the start and end activities and filter the log.

  • We can then put this filtered log filter_end_points into a process map.

8.3.7 Bespoke exploration

  • Further bespoke explorations are described in this process mining blog post.It includes an example of creating an interruption index to tell us how much a resource “toggles” between multiple cases, and a heatmap to show peak periods during the day.

  • Let’s re-create the hourly heatmap for our hospital billing data. We can see that some activities are more common at lunchtime (e.g. STORNO and MANUAL), and some occur in the early hours of the morning (e.g. CODE ERROR). Perhaps the very early activities outside of normal office hours are automatic system activities?

8.3.8 Creating your own event log

  • In the visualisations so far we have used the hospital billing event log that has been created for us. Creating our own log from raw events data is straightforward. Follow the bupaR “create event log” guidance.

  • When creating an event log, as well as the activity name and a timestamp, each activity row needs its own globally unique ID, the resource who carried out each activity and a lifecycle status. The lifecycle would typically describe if a case has just started, is in-progress, or is complete when each activity occurs. If a unique activity ID and a lifecycle status are missing they can easily be artifically created using a row count for the ID and by assigning the status to a single value such as “complete” or “NA”. If Who did each activity (the resource) may also be missing. Resource can also simply be assigned as a blank “NA” column.

  • The example below demonstrates creating an event log with minimum amount of information required to draw a process map with the activity id, resource, and status created artificially before the event log is built with the bupaR::eventlog() function.

  • After creating an event log you may need to add additional columns with useful information. This is known as “enrichment”. However, this breaks the attributes you have mapped using the bupaR::eventlog() function and the event log cannot be used with bupaR functions.

  • Rather than using the bupar::eventlog() function to recreate the event log, which is time consuming, it’s much faster to use bupar::mapping() to save the attribute mappings and then bupar::re_map() to add them back to the broken event log. The time saving can be significant for large event logs that can take 5 to 10 mins to re-create whereas the re_map takes only a few seconds.

For example, below we create an event log using the same example data in the bupaR guidance.

## Warning: Column `resource` joining factor and character vector, coercing into
## character vector
## Error: Only strings can be converted to symbols
  • To fix the log so that it can be used to draw a process map , below we remap the log attributes with the bupaR::remap() function. We re-map the attributes we saved in the code above using the buparR::mapping() function. After remapping We can draw a process map from the re-mapped event log.

8.4 Can we predict the age of a case?

  • Perhaps there is a feature in the data that can explain why there are two time periods in which the activity BILLED occurs? This pattern is seen clearly if we visualise the number of times the BILLED activity occurs in each month.

  • It will be most useful to explore the characteristics known when the first activity (NEW) is carried out. In this way the early information at case start could be used to predict and better understand in which of the two time periods the BILLED activity is likely to occur. Below we extract the characteristics at the start of the process for each case. We also categorise the very detailed diagnosis column by extracting just the first letter of the full diagnosis code. This is so that the plots created are less complicated.
  • First we see what is contained in each predictor quickly using skimr.
## Skim summary statistics
##  n obs: 11106 
##  n variables: 18 
## 
## -- Variable type:character -------------------
##       variable missing complete     n min max empty n_unique
##      actorange   11106        0 11106  NA  NA     0        0
##         actred   11106        0 11106  NA  NA     0        0
##        blocked    1106    10000 11106   5   5     0        1
##        case_id       1    11105 11106   1   3     0     9999
##       casetype    1106    10000 11106   1   1     0        7
##      closecode   11106        0 11106  NA  NA     0        0
##      diagnosis    6456     4650 11106   1   3     0      467
##  diagnosis_cat    6456     4650 11106   1   1     0       26
##          flaga    1106    10000 11106   5   5     0        1
##          flagb    1106    10000 11106   5   5     0        1
##          flagc   11106        0 11106  NA  NA     0        0
##          flagd    1105    10001 11106   4   5     0        2
##    iscancelled    1106    10000 11106   5   5     0        1
##       isclosed    1098    10008 11106   4   5     0        2
##        msgcode   11106        0 11106  NA  NA     0        0
##        msgtype   11106        0 11106  NA  NA     0        0
##     speciality    1106    10000 11106   1   1     0       22
## 
## -- Variable type:integer ---------------------
##  variable missing complete     n mean sd p0 p25 p50 p75 p100 hist
##  msgcount   11106        0 11106  NaN NA NA  NA  NA  NA   NA
  • The skimr output shows 6 columns with all missing values we can remove as they offer no information. Next, we join the first known case characteristics (from when the case started at the NEW activity) to the final BILLED activity rows.
## Joining, by = "case_id"
  • Finally, we can plot different case characteristics by the month in which the final case activity BILLED occurred. Starting with casetype we can see a strong relationship with values A and B. These two values each dominate each of the two peaks separately. The second peak in January to April 2014 was almost all casetype value “A” or “NA” at the first activity (NEW). While the first peak is almost all casetype value “B” at the first activity.

  • casetype does not entirely explain the two peaks (i.e. sometimes other casetype values occur). Below we put the ggplot code from above into a function so that it is easy to repeat the plot for more of the case characteristics known at the start of the case.
  • In the plots below the diagnosis column is mostly “NA” at the earlier peak of BILLED activities, while all the other diganosis categories are more common than “NA” in the second peak of BILLED activity.
Show all the plots

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]


  • If we wanted to predict the eventual age of the case when it reaches the BILLED activity from what we knew about the case at the first NEW activity, knowing if it is casetype “A”, “B” or “NA” will be a good predictor. Also, cases which move quickly from the first NEW activity to CHANGE DIAGN (with an average of 0 days) their subsequent average time to the FIN activity is only 89 days, but much longer for cases that move directly from NEW to FIN (with an average time of 361 days). This means that the activities that cases pass through should also help us predict the case age when it is in-flight, and not just from what we knew about the case when it began.

  • There are likely to be complex combinations of additional predictors or “features” that could be used to predict the most likely case age. In a later chapter we introduce the caret package as a way to automatically combine values to create more accurate predictions.

8.5 Process Discovery algorithms

  • There are also algorithmic methods designed to reveal important paths and the causal structure among the complexity of the raw spaghetti-like process maps. They include the Heuristics Miner package and the Fuzzy Miner package that represent the process as a simplified model.

8.5.1 Flexible Heuristics Miner

  • The Flexible Heuristics Miner algorithm is well explained in the Flexible Heuristics Miner paper using worked examples and set notation.

  • First we create a simple precedence matrix. The activities on the left hand y axis are the antecedent (so come first in the process), while the activities along the bottom x axis are the consequent (so occur after the antecedent). A more literal description of a precedence matrix is a “directly-follows frequency matrix”.

  • A lower code version of the … is..

  • This precedence matrix is equivalent to the FHM algorithm example in Table 2 of the Flexible Heuristics Miner paper.

  • Using heuristicsmineR::precedence_matrix_length_two_loops() we can also create “length two loops”. This shows which activities recurring after a different activity is carried out. For example, the value 129 and STORNO,BILLED tells us there were 129 STORNO,BILLED,STORNO activity sequences in the event log. This matrix is equivalent to the worked FHM algorithm example in Table 3 of the Flexible Heuristics Miner paper.

  • The heuristicsmineR::dependency_matrix() calculates formula (1) in the Flexible Heuristics Miner paper. In words, the formula is the number of times activity B follows activity A, minus the other direction (A follows B), divided by the sum of the previous two values plus one. The value is always between -1 and 1. A high value indicates a dependency relation in that direction.

  • Below we calculate the dependency matrix and illustrate the calculation with one task pair following the text at the start of section 4.1 of the FHM paper.

  • In the left hand matrix above the task CODE ERROR is directly followed two times by task REOPEN, but the other way never occurs. This means that the dependency relation value of CODE ERROR ⇒W REOPEN = (2 - 0) / (2 + 0 + 1) = 0.7. This indicates we are not completely sure of the dependency relation with only two observations.

  • However, if CODE ERROR is directly followed by REOPEN many more times than two (say 50 times) and the other way around never occurs we can be much more certain there is a dependency relation. The value would then be CODE ERROR ⇒W REOPEN = (50 - 0) / (50 + 0 + 1) = 0.98 which is much closer to one indicating a very strong dependence relation.

  • If noise caused CODE ERROR to follow REOPEN once, the value of CODE ERROR ⇒W REOPEN is (50 -1) / (50 + 1 + 1) = 0.94. This is only a little lower and indicates we are still confident of a dependency relation with the value being so close to 1.

  • If we alter the threshold value for which we will accept a dependency relation (e.g. values above 0.9) then view that in a matrix, we can be fairly certain there is a causality between the antecedent and consequent activities shown below.

  • We can also render the simplified matrix of strong dependency relations above 0.9 into a diagram. This shows the dependency relation value on each arrow that is always above 0.9.
  • A causal net uses the dependency graph above but annotated to show the number of cases that pass through each activity when the dependency relation is set at 0.9.
  • So if we created a dependency matrix with a much lower threshold the causal net would have more paths and be more complex.

8.5.2 Fuzzy miner

## 
  |                                                                            
  |                                                                      |   0%
## Error in apply(trace_events, 1, addEvent): dim(X) must have a positive length
  • Due to the error we run the algorithm with the eventdataR::traffic_fines event log that is also demonstrated in section 5.2.
  • Then we can visualize the fuzzy model for a given set of parameters.
  • Which can be compared to the full complex process map for traffic fines to determine if the fuzzymineR map is useful for process discovery.