關於大樹據的圖形呈現 Regarding to large-size data visualization

礙於之前做完 prediction 後, 要計算的 precision-recall curve 的資料量一份估計約 300~500MB 左右, 屬於 medium size  less than 1GB 我想主要花上的時間在兩個部分:read() 跟 plot()。

首先,計算了一下資料量,假設若是 107 筆資料, 約 180 MB 左右。(double:8 bytes x2 + tab: 4bytes)
若是在輸出時注意取的位數 (小數點4位數), 約降到 120MB。(float:4 bytes x2 + tab: 4bytes)
一開始還在想選擇 R 或其他工具非 interacitve 環境可能可以加快 plot(),可找了很多繪圖工具後覺得不是輸出太醜,或者資料讀取的format要重新制定,就是檔案讀取仍是會遇到瓶頸。
既然 pr-curve 主要是透過肉眼比較曲線的分佈,其實一開始在選擇 data 時就可以做一個縮減的動作,降到合理的檔案大小後,再去 plot()就不用擔心計算花費太多時間的問題。
原先 180MB 光 write() 就花了 220s,  read()+plot() 約 70s. 減到1/10的size (約18MB)僅要 20s以下。 (sampling2* 所需的時間先忽略不計)不過若可以應該在一開始產生 pr-curve 的 data 時就應該加個 sampling的動作,當然就不用花這麼多時間在處理這些問題。

-------------------------------------------------------

Yesterday I was thinking about how to plot large amount of data properly for predicted result while finishing prediction. My data size of predicted result is about 300 ~ 500 MB, which belongs to medium size of data, however, still takes a while for matplotlib to calculate and plot, and takes even longer if we need to draw several set of them at once.

I did simulate a scenario3* and took 107  deal of transactions, around 180MB. Here comes a trick that we keep the maximum size of predicted data due to the double type of data format, so we better choose digit capacity properly.

Then I started off seeking for alternatives for plot data much fast than matplotlib, such as R, gnuplot and so on. Then I figured out that my target was unachievable for reasons, such as the ugly output figure, definition of new data format and bottlenecks occurred on other softwares.

Since PR-Curve was only used to compare the tendency against different results by EYES, we don't even have to worry about the data size by preprocessing and reducing the size to adequate amount, to say less than 100mb or so.
By doing that, I saved about 1/3 time-spent on calculation and as a matter of fact, I don't need to pay attention on this issue if I know sampling is fine to the final result before generating PR-Curve data.


[reference]
1. Large plot: ~20 million samples, gigabytes of data (stackoverflow)
2. Perl script to handle large-size data by sampling (github)
3. scenario for drawing a 200MB data using matplotlib (github)

Comments

Popular posts from this blog

股票評價(Stock Valuation) - 股利折現模型

openwrt feed的使用

R 語言:邏輯回歸 Logistic Regression using R language (二)