We will perform a Market Basket Analysis (also called Association Rules). This techniques shows which items occur together, like “users who bought X, also bought Y and Z”. We will use this method to find out, which properties are important to conclude that a mushroom is edible.
Terminology
Before we can start, we need some basic terminology.
Support: The support represents frequenc of an itemset in the data.
Confidence: The confidence shows how often a rule is found to be true, e.g. if x is bought, how often is y bought. In this context, rather than x and y, the terms Left-Hand-Side (LHS) and Right-Hand-Side (RHS) are used.
Lift: Lift provides the information if a rule LHS \rightarrow RHS is random (LHS and RHS are independent) or not. If Lift > 1, both occurances are dependent. Only for Lift greater 1 a potential useful rule can be found.
Data Import
First, we load packages arules. Please make sure this package is installed before loading. Here the package manager pacman takes care of this.
We use dataset “mushrooms” from UCI machine learning repository. Data is downloaded with read.csv() from its “url”. The dataset does not include header information, so that column names are added manually.
library(pacman)
p_load(arules, arulesViz)
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
mushrooms <- read.csv(file = url, header = F)
colnames(mushrooms) <- c("edibility", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", "gill_att", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root", "stalk_surf_above", "stalk_surf_below", "stalk_color_above", "stalk_color_below", "veil_type", "veil_color", "ring_nr", "ring_type", "spore_print_color", "population", "habitat")
Data Transformation
Before we can perform market basket analysis we need to transform our dataframe to type transaction.
trans <- as(mushrooms, "transactions")
We can extract the rules with apriori() function. We pass our data. The resulting RHS is filtered for only edible mushrooms, because we want to find rules, that clearly define this outcome. Some minimum values for support and confidence are passed as well.
Next, we sort the rule in decreasing order, sorted by lift. With inspect() we can view the rules.
rules <- apriori(data = mushrooms,
appearance = list(rhs=c("edibility=e"), default='lhs'),
parameter = list(supp = 0.1, conf = .1))
## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.1 0.1 1 none FALSE TRUE 5 0.1 1 ## maxlen target ext ## 10 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 812 ## ## set item appearances ...[1 item(s)] done [0.00s]. ## set transactions ...[119 item(s), 8124 transaction(s)] done [0.01s]. ## sorting and recoding items ... [56 item(s)] done [0.00s]. ## creating transaction tree ... done [0.00s]. ## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [4.63s]. ## writing ... [126040 rule(s)] done [0.07s]. ## creating S4 object ... done [0.08s].
rules <- sort(rules, by = "lift", decreasing = T)
inspect(rules[1:10])
## lhs rhs support confidence ## [1] {gill_size=b,gill_color=n} => {edibility=e} 0.1083210 1 ## [2] {odor=n,stalk_root=e} => {edibility=e} 0.1063516 1 ## [3] {bruises=f,stalk_root=e} => {edibility=e} 0.1063516 1 ## [4] {gill_spacing=w,habitat=g} => {edibility=e} 0.1299852 1 ## [5] {gill_spacing=w,stalk_shape=t} => {edibility=e} 0.1063516 1 ## [6] {gill_spacing=w,gill_size=b} => {edibility=e} 0.1299852 1 ## [7] {bruises=t,population=y} => {edibility=e} 0.1201379 1 ## [8] {odor=n,population=y} => {edibility=e} 0.1191531 1 ## [9] {ring_type=p,population=y} => {edibility=e} 0.1280158 1 ## [10] {stalk_shape=t,population=y} => {edibility=e} 0.1063516 1 ## lift count ## [1] 1.930608 880 ## [2] 1.930608 864 ## [3] 1.930608 864 ## [4] 1.930608 1056 ## [5] 1.930608 864 ## [6] 1.930608 1056 ## [7] 1.930608 976 ## [8] 1.930608 968 ## [9] 1.930608 1040 ## [10] 1.930608 864
The highest lift is 1.93. For all these rules we get a confidence of 1. All these inputs on LHS lead to an edible mushroom. Since we want to make absolutely sure, that we don’t eat a poisonous mushroom, we should only rely on rules with confidence of 1.
We can plot the rules. The confidence is plotted over support, color code represents lift.
png("AssociationRules1.png")
plot(rules)
dev.off()
## png ## 2
There are some other ways of representing the data. This is the graph method for the first twenty rules.
plot(rules[1:20], method="graph", control=list(type="items"))
## Available control parameters (with default values): ## main = Graph for 20 rules ## nodeColors = c("#66CC6680", "#9999CC80") ## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF") ## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF") ## alpha = 0.5 ## cex = 1 ## itemLabels = TRUE ## labelCol = #000000B3 ## measureLabels = FALSE ## precision = 3 ## layout = NULL ## layoutParams = list() ## arrowSize = 0.5 ## engine = igraph ## plot = TRUE ## plot_options = list() ## max = 100 ## verbose = FALSE
Another way to plot the relationship is the paracoord-method.
plot(rules[1:20], method="paracoord", control=list(reorder=TRUE))
I hope this gave you a kick-start on association rules.
More Information
- Mushroom Dataset by Jeff Schlimmer, provided via UCI Machine Learning Repository
- Wikipedia Association Rule Learning