We will perform a Market Basket Analysis (also called Association Rules). This techniques shows which items occur together, like “users who bought X, also bought Y and Z”. We will use this method to find out, which properties are important to conclude that a mushroom is edible.

Terminology

Before we can start, we need some basic terminology.

Support: The support represents frequenc of an itemset in the data.

Confidence: The confidence shows how often a rule is found to be true, e.g. if x is bought, how often is y bought. In this context, rather than x and y, the terms Left-Hand-Side (LHS) and Right-Hand-Side (RHS) are used.

Lift: Lift provides the information if a rule LHS \rightarrow RHS is random (LHS and RHS are independent) or not. If Lift > 1, both occurances are dependent. Only for Lift greater 1 a potential useful rule can be found.

Data Import

First, we load packages arules. Please make sure this package is installed before loading. Here the package manager pacman takes care of this.

We use dataset “mushrooms” from UCI machine learning repository. Data is downloaded with read.csv() from its “url”. The dataset does not include header information, so that column names are added manually.

library(pacman)
p_load(arules, arulesViz)

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
mushrooms <- read.csv(file = url, header = F)
colnames(mushrooms) <- c("edibility", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", "gill_att", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root", "stalk_surf_above", "stalk_surf_below", "stalk_color_above", "stalk_color_below", "veil_type", "veil_color", "ring_nr", "ring_type", "spore_print_color", "population", "habitat")

Data Transformation

Before we can perform market basket analysis we need to transform our dataframe to type transaction.

trans <- as(mushrooms, "transactions")

We can extract the rules with apriori() function. We pass our data. The resulting RHS is filtered for only edible mushrooms, because we want to find rules, that clearly define this outcome. Some minimum values for support and confidence are passed as well.

Next, we sort the rule in decreasing order, sorted by lift. With inspect() we can view the rules.

rules <- apriori(data = mushrooms, 
         appearance = list(rhs=c("edibility=e"), default='lhs'),
         parameter = list(supp = 0.1, conf = .1))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.1    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 812 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[119 item(s), 8124 transaction(s)] done [0.01s].
## sorting and recoding items ... [56 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [4.63s].
## writing ... [126040 rule(s)] done [0.07s].
## creating S4 object  ... done [0.08s].

rules <- sort(rules, by = "lift", decreasing = T)
inspect(rules[1:10])

##      lhs                               rhs           support   confidence
## [1]  {gill_size=b,gill_color=n}     => {edibility=e} 0.1083210 1         
## [2]  {odor=n,stalk_root=e}          => {edibility=e} 0.1063516 1         
## [3]  {bruises=f,stalk_root=e}       => {edibility=e} 0.1063516 1         
## [4]  {gill_spacing=w,habitat=g}     => {edibility=e} 0.1299852 1         
## [5]  {gill_spacing=w,stalk_shape=t} => {edibility=e} 0.1063516 1         
## [6]  {gill_spacing=w,gill_size=b}   => {edibility=e} 0.1299852 1         
## [7]  {bruises=t,population=y}       => {edibility=e} 0.1201379 1         
## [8]  {odor=n,population=y}          => {edibility=e} 0.1191531 1         
## [9]  {ring_type=p,population=y}     => {edibility=e} 0.1280158 1         
## [10] {stalk_shape=t,population=y}   => {edibility=e} 0.1063516 1         
##      lift     count
## [1]  1.930608  880 
## [2]  1.930608  864 
## [3]  1.930608  864 
## [4]  1.930608 1056 
## [5]  1.930608  864 
## [6]  1.930608 1056 
## [7]  1.930608  976 
## [8]  1.930608  968 
## [9]  1.930608 1040 
## [10] 1.930608  864

The highest lift is 1.93. For all these rules we get a confidence of 1. All these inputs on LHS lead to an edible mushroom. Since we want to make absolutely sure, that we don’t eat a poisonous mushroom, we should only rely on rules with confidence of 1.

We can plot the rules. The confidence is plotted over support, color code represents lift.

png("AssociationRules1.png")
plot(rules)
dev.off()

## png 
##   2

There are some other ways of representing the data. This is the graph method for the first twenty rules.

plot(rules[1:20], method="graph", control=list(type="items"))

## Available control parameters (with default values):
## main  =  Graph for 20 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

Another way to plot the relationship is the paracoord-method.

plot(rules[1:20], method="paracoord", control=list(reorder=TRUE))

I hope this gave you a kick-start on association rules.

More Information

Mushroom Dataset by Jeff Schlimmer, provided via UCI Machine Learning Repository
Wikipedia Association Rule Learning