Association rule mining (1)

July 14, 2013

[***以下為個人理解，如有錯誤歡迎指證***]

[***This article may contain mistakes and thus any correction is greatly appreciated***]

[Background]
由Agrawal et al在1993年提出, 最初用於市場分析, 主要目的為找出顧客所購買的商品中, 商品間的關聯性. 目前在database及data-ming領域上被廣泛研究.
[Model] 這裡討論 data 跟 rule
在Association Rule中, 假設所有的data已經被分類.
(1) 有 m 個商品(item), 商品集合 I = {i1, i2, ..., im}
(2) 有 n 筆交易(transaction), 交易集合 T = {t1, t2, ..., tn}, 且 t ∈ I
ex. t1 = {bread, cheese, milk}, t2 = {apple, eggs, salt, yogurt}, ..., tn = {biscuit, eggs, banana}
(3) itemset : set of items; ex. {i2, i7, i11}
(4) k-itemset : a itemset with k items: ex. {i2, i7, i11} is a 3-itemset
若將每個transaction視為一份text文件, 則每個文件包含一堆keywords(像是bread, apple, banana等)

[Definition]

Association rule: an implication of the form X => Y where X, Y ⊆ I and X ∩ Y = Null.
X,Y為屬於I的itemset, 且 X, Y∈ i.
ex. X = {bread, milk}, Y = {banana} 白話的說, 就是當某筆交易中買了 bread跟milk, 同時也會買 banana.

[Rule strength measure]
Support: The rule holds with support SUP in T if SUP% of transactions contain X.
# Supp(X) = X.count / n = X在T中佔的比例.
Confidence: The rule holds in T with confidence CONF if CONF% of transactions contain X also contain Y.
# confidence = (X∪Y).count/X.count
Confidence = Pr(X|Y) = Pr(X∪Y)|Pr(X). 是指T中X,Y皆發生所佔的比例.
=> Association rule 是一種表示方式, 說明當X發生時, Y以某種機率發生.

回想：
→ data n rule
b. I(m) / T{n} / T_i: itemset / k items in T_i : k-itemset

→ X is associated with Y in a certain way where X,Y belong to I and intersection of X and Y is empty.

→ support n confidence

[Goal]
找到能夠滿足user定義的 minimum support (minisup)與minimum confidence(miniconf)的rules
(註 1) 在 transaction data中, 並無考慮購買的數量及價格.
(註 2) minimum support 的設定只要針對 X, Y自然會符合.
事實上所有的 algorithms 找到的 rules 結果都相同, 因為 association rules 是唯一的, ~~僅表現在 computational efficiencies 與 memory requirements上的差異~~., algorithms 間比較的是在 time & space complexity 上的不同。

這裡介紹最常見的 Apriori Algorithm

[Apriori]
主要原則：
(1) 定義 frequent itemset: an itemset whose support ≥ minsup
(2) 由於每個frequent itemset的子集一定也是frequent itemset, Apriori 使用目前的 frequent itemsets (or large itemsets) 來推論出下一個 frequent itemset
(3)
Ck = candidate of size k: given Fk-1, those itemsets can be frequent
Fk = frequent itemset of size k.

ex.

minsup = 0.5
TID	ITEMS
t1	1,3,4
t2	2,3,5
t3	1,2,3,5
t4	2,5

1st Scan T,
C1: {1}:3, {2}:3,{3}:3,~~{4}:1~~,{5}:2
F1: {1}:3, {2}:3,{3}:3,{5}:2
C2: {1,2},{1,3},{1,5},{2,3},{2,5},{3,5}
2nd Scan T,
C2: ~~{1,2}:1~~,{1,3}:2,~~{1,5}:1~~,{2,3}:2,{2,5}:3,{3,5}:2
F2: {1,3}:2,{2,3}:2,{2,5}:3,{3,5}:2
*C3: {2,3,5}
Candidate 的產生主要包含兩個步驟：join 與 prune。以此為例：
{1,3},{2,3},{2,5},{3,5}
(1) join 後產生的可能 3-itemset 為 {1,2,3}, {1,3,5}, {2,3,5}
(2) prune 為刪除不可能的 itemset {1,2,3} 與 {1,3,5}
理由: 若 {1,2,3}為C3, 則 {1,2}, {1,3}, {2,3} 須皆是 F2 (與事實不符)
=> 得 C3: {2,3,5}

Pseudo Code


Apriori Algorithm(T) {

   C1 = init-pass(T);

   F1 = { f | f ∈ C1 and f.count/n ≥ minsup}, where n = # of transactions

  

   for (k=2;  Fk   K ;k++)

     Ck = candidate-gen( Fk-1 );

     for ( t ∈ T )

       for ( c ∈ Ck )

          if c is contained in t; then

             c.count ++;

       end

     end

     Fk = { c | c ∈ Ck and c.count/n ≥ minsup};

  end

}




candidate-gen ( Fk-1 ) {

    Ck  = Null;

    for fi, fj ∈ Fk-1

         c = fi ∪ fj , where c's size is k     // join

       Append c into Ck;

       for (k-1)-subset s in c

          if s not in Fk-1; then

             delete c from Ck;   // prune

          endif

      end

   end

   return Ck;

}

透過 Apriori Algorithm 找到 frequent itemsets 後, 接著從 frequent itemsets 產生 rules.
對於每個 itemset X, 存在子集 A與 B = X -A, 使得 A =>B 為 association rule, 其中 confidence (B|A) ≥ minconf.

最後總結Apriori Algorithm 的一些特性:
(1) Level-wise search
(2) space complicity O(2^m), m = # of items
(3) K = 最大 itemset 的 size, K = (10) in practice
(4) 在計算 support(B|A) 或 support(A) 時, 由於所有計算已經被記錄, 所以不需要再重復scan T.

[Reference]
1. Bing Liu, UIC
2. Wiki

Search This Blog

JOGG's

Association rule mining (1)

Comments

Post a Comment

Popular posts from this blog

股票評價(Stock Valuation) - 股利折現模型

openwrt feed的使用

R 語言：邏輯回歸 Logistic Regression using R language （二）