Association rule mining (1)
[***以下為個人理解,如有錯誤歡迎指證***]
[***This article may contain mistakes and thus any correction is greatly appreciated***]
[Background]
由Agrawal et al在1993年提出, 最初用於市場分析, 主要目的為找出顧客所購買的商品中, 商品間的關聯性. 目前在database及data-ming領域上被廣泛研究.
[Model] 這裡討論 data 跟 rule
在Association Rule中, 假設所有的data已經被分類.
(1) 有 m 個商品(item), 商品集合 I = {i1, i2, ..., im}
(2) 有 n 筆交易(transaction), 交易集合 T = {t1, t2, ..., tn}, 且 t ∈ I
ex. t1 = {bread, cheese, milk}, t2 = {apple, eggs, salt, yogurt}, ..., tn = {biscuit, eggs, banana}
(3) itemset : set of items; ex. {i2, i7, i11}
(4) k-itemset : a itemset with k items: ex. {i2, i7, i11} is a 3-itemset
若將每個transaction視為一份text文件, 則每個文件包含一堆keywords(像是bread, apple, banana等)
[Definition]
[Background]
由Agrawal et al在1993年提出, 最初用於市場分析, 主要目的為找出顧客所購買的商品中, 商品間的關聯性. 目前在database及data-ming領域上被廣泛研究.
[Model] 這裡討論 data 跟 rule
在Association Rule中, 假設所有的data已經被分類.
(1) 有 m 個商品(item), 商品集合 I = {i1, i2, ..., im}
(2) 有 n 筆交易(transaction), 交易集合 T = {t1, t2, ..., tn}, 且 t ∈ I
ex. t1 = {bread, cheese, milk}, t2 = {apple, eggs, salt, yogurt}, ..., tn = {biscuit, eggs, banana}
(3) itemset : set of items; ex. {i2, i7, i11}
(4) k-itemset : a itemset with k items: ex. {i2, i7, i11} is a 3-itemset
若將每個transaction視為一份text文件, 則每個文件包含一堆keywords(像是bread, apple, banana等)
[Definition]
Association rule: an implication of the form X => Y where X, Y ⊆ I and X ∩ Y = Null.
X,Y為屬於I的itemset, 且 X, Y∈ i.
ex. X = {bread, milk}, Y = {banana} 白話的說, 就是當某筆交易中買了 bread跟milk, 同時也會買 banana.
[Rule strength measure]
Support: The rule holds with support SUP in T if SUP% of transactions contain X.
# Supp(X) = X.count / n = X在T中佔的比例.
Confidence: The rule holds in T with confidence CONF if CONF% of transactions contain X also contain Y.
# confidence = (X∪Y).count/X.count
Confidence = Pr(X|Y) = Pr(X∪Y)|Pr(X). 是指T中X,Y皆發生所佔的比例.
=> Association rule 是一種表示方式, 說明當X發生時, Y以某種機率發生.
[Goal]
找到能夠滿足user定義的 minimum support (minisup)與minimum confidence(miniconf)的rules
(註 1) 在 transaction data中, 並無考慮購買的數量及價格.
(註 2) minimum support 的設定只要針對 X, Y自然會符合.
事實上所有的 algorithms 找到的 rules 結果都相同, 因為 association rules 是唯一的,僅表現在 computational efficiencies 與 memory requirements上的差異., algorithms 間比較的是在 time & space complexity 上的不同。
這裡介紹最常見的 Apriori Algorithm
[Apriori]
主要原則:
(1) 定義 frequent itemset: an itemset whose support ≥ minsup
(2) 由於每個frequent itemset的子集一定也是frequent itemset, Apriori 使用目前的 frequent itemsets (or large itemsets) 來推論出下一個 frequent itemset
(3)
Ck = candidate of size k: given Fk-1, those itemsets can be frequent
Fk = frequent itemset of size k.
ex.
1st Scan T,
C1: {1}:3, {2}:3,{3}:3,{4}:1,{5}:2
F1: {1}:3, {2}:3,{3}:3,{5}:2
C2: {1,2},{1,3},{1,5},{2,3},{2,5},{3,5}
2nd Scan T,
C2:{1,2}:1,{1,3}:2,{1,5}:1,{2,3}:2,{2,5}:3,{3,5}:2
F2: {1,3}:2,{2,3}:2,{2,5}:3,{3,5}:2
*C3: {2,3,5}
Candidate 的產生主要包含兩個步驟:join 與 prune。以此為例:
{1,3},{2,3},{2,5},{3,5}
(1) join 後產生的可能 3-itemset 為 {1,2,3}, {1,3,5}, {2,3,5}
(2) prune 為刪除不可能的 itemset {1,2,3} 與 {1,3,5}
理由: 若 {1,2,3}為C3, 則 {1,2}, {1,3}, {2,3} 須皆是 F2 (與事實不符)
=> 得 C3: {2,3,5}
Pseudo Code
透過 Apriori Algorithm 找到 frequent itemsets 後, 接著從 frequent itemsets 產生 rules.
對於每個 itemset X, 存在子集 A與 B = X -A, 使得 A =>B 為 association rule, 其中 confidence (B|A) ≥ minconf.
最後總結Apriori Algorithm 的一些特性:
(1) Level-wise search
(2) space complicity O(2^m), m = # of items
(3) K = 最大 itemset 的 size, K = (10) in practice
(4) 在計算 support(B|A) 或 support(A) 時, 由於所有計算已經被記錄, 所以不需要再重復scan T.
[Reference]
1. Bing Liu, UIC
2. Wiki
X,Y為屬於I的itemset, 且 X, Y∈ i.
ex. X = {bread, milk}, Y = {banana} 白話的說, 就是當某筆交易中買了 bread跟milk, 同時也會買 banana.
[Rule strength measure]
Support: The rule holds with support SUP in T if SUP% of transactions contain X.
# Supp(X) = X.count / n = X在T中佔的比例.
Confidence: The rule holds in T with confidence CONF if CONF% of transactions contain X also contain Y.
# confidence = (X∪Y).count/X.count
Confidence = Pr(X|Y) = Pr(X∪Y)|Pr(X). 是指T中X,Y皆發生所佔的比例.
=> Association rule 是一種表示方式, 說明當X發生時, Y以某種機率發生.
回想:
→ data n rule
b. I(m) / T{n} / Ti: itemset / k items in Ti : k-itemset
→ data n rule
b. I(m) / T{n} / Ti: itemset / k items in Ti : k-itemset
→ X is associated with Y in a certain way where X,Y belong to I and intersection of X and Y is empty.
→ support n confidence
[Goal]
找到能夠滿足user定義的 minimum support (minisup)與minimum confidence(miniconf)的rules
(註 1) 在 transaction data中, 並無考慮購買的數量及價格.
(註 2) minimum support 的設定只要針對 X, Y自然會符合.
事實上所有的 algorithms 找到的 rules 結果都相同, 因為 association rules 是唯一的,
這裡介紹最常見的 Apriori Algorithm
[Apriori]
主要原則:
(1) 定義 frequent itemset: an itemset whose support ≥ minsup
(2) 由於每個frequent itemset的子集一定也是frequent itemset, Apriori 使用目前的 frequent itemsets (or large itemsets) 來推論出下一個 frequent itemset
(3)
Ck = candidate of size k: given Fk-1, those itemsets can be frequent
Fk = frequent itemset of size k.
ex.
minsup = 0.5
|
|
TID
|
ITEMS
|
t1
|
1,3,4
|
t2
|
2,3,5
|
t3
|
1,2,3,5
|
t4
|
2,5
|
1st Scan T,
C1: {1}:3, {2}:3,{3}:3,
F1: {1}:3, {2}:3,{3}:3,{5}:2
C2: {1,2},{1,3},{1,5},{2,3},{2,5},{3,5}
2nd Scan T,
C2:
F2: {1,3}:2,{2,3}:2,{2,5}:3,{3,5}:2
*C3: {2,3,5}
Candidate 的產生主要包含兩個步驟:join 與 prune。以此為例:
{1,3},{2,3},{2,5},{3,5}
(1) join 後產生的可能 3-itemset 為 {1,2,3}, {1,3,5}, {2,3,5}
(2) prune 為刪除不可能的 itemset {1,2,3} 與 {1,3,5}
理由: 若 {1,2,3}為C3, 則 {1,2}, {1,3}, {2,3} 須皆是 F2 (與事實不符)
=> 得 C3: {2,3,5}
Pseudo Code
Apriori Algorithm(T) {
C1 = init-pass(T);
F1 = { f | f ∈ C1 and f.count/n ≥ minsup}, where n = # of transactions
for (k=2; Fk K ;k++)
Ck = candidate-gen( Fk-1 );
for ( t ∈ T )
for ( c ∈ Ck )
if c is contained in t; then
c.count ++;
end
end
Fk = { c | c ∈ Ck and c.count/n ≥ minsup};
end
}
candidate-gen ( Fk-1 ) {
Ck = Null;
for fi, fj ∈ Fk-1
c = fi ∪ fj , where c's size is k // join
Append c into Ck;
for (k-1)-subset s in c
if s not in Fk-1; then
delete c from Ck; // prune
endif
end
end
return Ck;
}
透過 Apriori Algorithm 找到 frequent itemsets 後, 接著從 frequent itemsets 產生 rules.
對於每個 itemset X, 存在子集 A與 B = X -A, 使得 A =>B 為 association rule, 其中 confidence (B|A) ≥ minconf.
最後總結Apriori Algorithm 的一些特性:
(1) Level-wise search
(2) space complicity O(2^m), m = # of items
(3) K = 最大 itemset 的 size, K = (10) in practice
(4) 在計算 support(B|A) 或 support(A) 時, 由於所有計算已經被記錄, 所以不需要再重復scan T.
[Reference]
1. Bing Liu, UIC
2. Wiki
Comments
Post a Comment