關於 naive Bayes

June 29, 2014

在 machine learning in action 一書中提到了要談 native Bayes 得先了解 conditional probability.
他舉的例子很簡單, 但總老是忘記, 在此記錄.
Total: 7 balls, 2 buckets
[Bucket A] 2 gray, 2 black
[Bucket B] 1 gray, 2 black

conditional probability P(gray|B) 解釋為已知道選 bucket B的條件下, 是 gray的機率. 毫無疑問地, 是1/3. 又

$\begin{align*} P(gray|B) = \frac{P(gray \mspace{2pt} and \mspace{2pt} B)}{P(B)} 可理解為 \frac{「是 gray 且是bucket B的機率」}{「為bucket B的機率」}\end{align*}$
只是這樣的解釋一時自己無法理解 P(B), 原來是還是要補上「抽出一顆球」.

P(gray|B): 抽出一顆球, 在已知來自B的條件下, 是gray的機率. 1/3
P(B): 抽出一顆球是來自B的機率. 3/7
P(gray and B): 抽出一顆球, 是gray且來自B的機率 (or 是B中的gray的機率, 或許比較好理解). 1/7

$\begin{align*} P(gray|B) = \frac{P(gray \mspace{2pt} and \mspace{2pt} B)}{P(B)} = \frac{1/7}{3/7}=\frac{1}{3}\end{align*}$

不過通常要求的不會是 P(gray|B), 而是 P(B|gray).
利用 Bayes Rule (swap the symbol in a conditional probability statement):

$\begin{align*} P(gray|B) = \frac{P(B)P(gray|B)}{P(gray)}\end{align*}$

$\begin{align*} 通式： P(C|x) = \frac{P(C)P(x|C)}{P(x)}\end{align*}$

接著改寫先前依據 Bayesian Decision Theory 所提出的分類器：

If p1(x,y) > p2(x,y), then the class is 1 => P(C1|(x,y))
If p2(x,y) > p1(x,y), then the class is 2 => P(C2|(x,y))

因此 Bayesian Classification Rule 可以改寫為：

$\begin{align*} 通式： P(C_{i}|x_{1}, ..., x_{n}) = \frac{P(C_{i})P(x_{1}, ..., x_{n}|C_{i})}{P(x_{1}, ..., x_{n})}\end{align*}$

to-do next time:
>下次把利用 naive Bayes 作 automatic document classification 的應用記錄一下好了.

===== 20140701 =====
實例(from Machine Learning in Action 一書)：
有幾組留言
1) My dog has flea problems, help please.

2) Maybe not, take him to do park, stupid.

3) My dalmation is so cute, I love him.

4) Stop posting stupid worthless garbage.

5) Mr Licks ate my steak. How to stop him?

6) Quit buying worthless dog food, stupid.

目標：

根據有無 abusive 的字眼, 將留言做分類.
Classifier ( 2 classes here)

1: has abusive; 0: no abusive

做法：

0) 手動歸類上面六項留言是否為 abusive. 得到 class labels [ 0 1 0 1 0 1]

1) 將每個留言轉成 token array

2) 取 token arrays 中無重複的 tokens 當作 vocabulary list（features）

3) 針對每個留言檢查 vocabulary list 中的 token 是否存在, 並得到類似 [0110...11] 的vectors (features). 這表示每個留言所包含的 features. ( 1:存在, 0: 不存在 )

--- 到目前已經將 words 轉為 numbers ---

--- 我們已經知道某個word存在一些留言中, 也知道每個留言屬於哪個 class ---

回顧 NBC (Naive Bayesian Classifier)

$\begin{align*} 通式： P(C_{i}|x_{1}, ..., x_{n}) = \frac{P(C_{i})P(x_{1}, ..., x_{n}|C_{i})}{P(x_{1}, ..., x_{n})}\end{align*}$

翻譯：給定留言, 透過 step(3) 得到的 features vector, 它屬於某 class(i) 的機率.

剩下的參考源碼 https://github.com/ggc2012/ml_nbc.git

Search This Blog

JOGG's

關於 naive Bayes

Comments

Post a Comment

Popular posts from this blog

股票評價(Stock Valuation) - 股利折現模型

openwrt feed的使用

R 語言：邏輯回歸 Logistic Regression using R language （二）