Introduction to Recommendation 心得[1]

September 15, 2013

[***以下為個人理解，如有錯誤歡迎指證***]

[***This article may contain mistakes and thus any correction is greatly appreciated***]

因為學長推薦, 最近在 coursera 上一門叫「Introduction to Recommendation」的課，了解像 amazon, facebook, youtube, rakuten 等是怎麼預測並推薦你會想知道的資訊。

一開始當然遇到很多詞彙要澄清，這裡記錄一下：
1. unstructured data 指的是結構未知（廢言）的資料（通常是text, 但包括audio與video）。舉例來說像是對只懂中文的人來說，與他們說中文以外的語言，由於每個詞彙的語意不明，於是對方的話就成了 unstructured data ...
＊所以 structured data 的代表就像是 RDBMS，就是資料庫啦！
＊生活上為懂他國語言（unstructured data），我們從單辭文法上著手。在計算機裏頭使用NPL自然語言去找出辭意。

2. information retrieval　有別於NPL了解每個詞的意義，IR採用統計的規律性而非了解字義。這是一種猜，不過是很統計的猜法！最常使用的是 TFIDF (term frequency inverse document frequency)。TF * IDF 被用來評估一個詞彙對於一個文件或整個資料語庫的重要性。簡單說就是一個字在文件中出現的次數越多（TF），同時，在其他文件的次數越少（IDF），則表示這個詞相當具分類的能力。
定義：
＊ static content base
＊ dynamic information need
＊＊ invest time in indexing content and then queries will present in "real time"

3. information filtering　與IR相反，IF是 dynamic content base & static information need。另外之前一直搞混的地方是 IF 強調是系統透過ML將資料主動PUSH給使用者，而非單純寫個filtering rule。因為 data content 是會隨時間改變無法透過單一 rule 去篩選，IF 透過建立 user model 並不斷 feedback/update user need。
簡單的例子：想知道 user U 對 item X 是否感興趣？( 取材自 MSU Prof. Rong Jing IR 講義 )
想法一：看看 U 喜歡什麼
=> characterize X，即 content-based filtering 或 CBF (又稱 adaptive information filtering / selective dissemination information)
想法二：看看誰喜歡 X
=> characterize U，即 collaborative filtering 或 CF ( recommendation system )

想知道 User 3是否喜歡 movie 15 minutes，透過找到興趣相似*(rating)的 user 即可。

顯然的，CF 的方法不論是在空間中或計算量遠勝於 CBF ( 需要 content of items 才能做計算分析，而content的分析又尤為困難 )，精確度雖有待評估（其實也不需要評估，因為這領域資訊取得速度＞＞資訊本身精確度），但從此人們的喜好也不是這樣難以評估了（隱私問題）。CF 技術面的難度在於：user 相似性計算（similarity, ex. Pearson Correlation Coefficient for users 並透過 MAE作 performance evaluation）以及預測方法 (從 aggregate ratings from similar users )。CBF的技術面有點複雜，日後再研究。

＊CF 裡頭最重要的我想就是 rating 的觀念了。rating 代表了該 user 對該 item 的喜好程度，所以怎麼取得 users' rating 是關鍵的一步。

＊ collaborative 這詞面上的意思是指利用其他 users 的 data 代表某 user 的 rating。

4. recommenders　既然CF可以預測 user 可能想要什麼，進而就是推薦了。不過 recommenders 很多，目的及涵蓋範圍等也不盡相同，這裡講師在課堂上介紹了個 analytical framework tool 來分析不同的 recommender system：

- Domain, content to commerce and beyond: ex. news, products, matchmaking

- Purpose, ex. sale, eduation, commuity

- Recommendation Context, what is the user doing at the time of recommendation; how does the context constrain the recommender, *這個部份不是很清楚, context 原意是指上下文, 個人覺得應該是指recommender 如何根據上下文作出推薦？！

- Whose opinion, experts, phoaks, or people like you

- Personalization Level, generic/non-personalized, demographic, ephemeral, persistent

- Privacy and Trustworthiness, who knows what about me? is the recommender honest?

- Interface, input and output

- Recommendation Algorithm, non-personalized summary statistics, content-based filtering, collaborative filtering, others

大致是如此。接下來預計先針對 non-personalized recommendation 作介紹，並進一步說明 rating 與 prediction 是什麼一回事。另外第一份 programming homework 則是給了些 data 讓大家對 non-personalized recommendation 有實際的了解吧。

Search This Blog

JOGG's

Introduction to Recommendation 心得[1]

Comments

Post a Comment

Popular posts from this blog

股票評價(Stock Valuation) - 股利折現模型

openwrt feed的使用

R 語言：邏輯回歸 Logistic Regression using R language （二）