PTT災害事件擷取系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：26

、訪客IP：3.21.97.51

姓名

蔣佳峰(Chia-Feng Chiang) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

PTT災害事件擷取系統
(PTT Disaster Events Extraction System)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

台灣屬於較常遭受天然災害侵襲的國家，夏季常遭遇的颱風與不定期發生的地震，均影響民眾生活甚鉅。當這些天災達到警戒範圍時，救災單位必須儘速掌握災情，並調派相關資源前往災區救援。然而，為能有效降低災後損失，亦須仰賴災區民眾的主動回報，將迫切的災情資訊傳遞至救災單位。在災害發生時，這些災情資訊一般透過電話，向救災單位傳遞。值得重視的是，通常在災害發生後，災情回報往往呈現爆炸性的增加，如救災單位接聽人手不足，或將成為迅速掌握災情的窒礙。
隨著網路通訊的蓬勃發展，3C產品的普及率已逐漸提升，民眾從網路上交換訊息也更加便利，災害發生當下，這些災情資訊也可能在社群網路間流動。因此，我們另闢一個獲取災情資訊的管道：從社群媒體中獲取災情資訊。
此一任務涉及資訊擷取（Information Extraction）的技術，從非結構化的文字資料擷取出特定訊息，並儲存於資料庫中。在本論文中，我們建立一個PTT災害事件擷取系統，使用批踢踢實業坊做為資訊來源，透過網路爬蟲定期抓取民眾發表的文章內容，並使用命名實體辨識（Named entity recognition）擷取出「災害名稱」、「災害地點」及「災情敘述」等災情資訊，以建立災害事件報告。
本論文分為三個部分，第一部分為文章前處理作業，透過網路爬蟲分析PTT網頁版的HTML結構，從台灣各地看板及八卦板定期抓取大量文章並儲存。第二部分為文章分類，使用自動化方式從訓練資料獲取分類用特徵，透過SVM建立分類模型，並將大量的文章過濾出有效的災情相關貼文。第三部分為命名實體擷取，透過中央大學WIDM實驗室提供的NER_Tool，使用條件隨機域（Conditional Random Field）做為演算法，以此建立災害名稱、災害地點及災情敘述等三個辨識模型。根據實驗結果顯示：經人工標記後的測試資料比較，各模型在Exact Match皆有F-Measure高於0.7的成果，而Partial Match的F-Measure皆高於0.75。

摘要(英)

Taiwan is the country which is often affected by natural disasters such as typhoon and earthquake. When these natural disasters reach the scope of alert, the disaster relief units must quickly grasp the information. In order to effectively reduce the losses, we must also rely on the active report of the people in disaster areas. In the event of a disaster, these disaster information is generally transmitted by telephone to the disaster relief unit. It is worth noting that, the reports of the disaster appear explosively. Relief units hard to handle great amount of reports with the lack of manpower. The fact becomes the bottleneck of grasping disaster information.
With the development of Internet, 3C product penetration has been gradually improved. It is more convenient to exchange information from the Internet. When the disaster occurs, disaster information may also be exchanged. As a result, we have an another way getting disaster information: access to disaster information from social network.This task involves information extraction technology, from the unstructured text information to extract the specific message, and stored in the database. In this paper, we set up a PTT disaster event extraction system, using the PTTWeb as a source of information, crawling regularly through the web crawler, and using Named entity recognition Identify disaster information such as ”disaster name”, ”disaster location” and ”damage description” to establish disaster reports.
This paper is divided into three parts. The first part of the article is pre-processing operations. Using web crawler to fetch PTT posts. The second part is the classification of articles, by using SVM to build a classification model in order to filter out disaster related posts. The third part is the named entity recognition. The training tool is proposed by the NCU WIDM lab. Conditional random field is used as the training algorithm. We have built three models including, disaster name, disaster location and damage description. In experiments, those models in exact match test can get the result with F-Measure higher than 0.7, and F-Measure higher than 0.75 in partial match test.

關鍵字(中)

★ 命名實體擷取
★ 災害事件
★ 資訊擷取

關鍵字(英)

★ NER
★ Disaster Events
★ Information Extraction

論文目次

摘要 i
Abstract ii
目錄 iii
圖片目錄 iv
表格目錄 v
1 緒論 1
1.1 研究動機 1
1.2 研究背景 2
1.3 章節概要 5
2 相關研究 6
2.1 社群媒體與災害事件 6
2.2 分類問題 7
2.3 命名實體辨識 8
3 系統架構及方法 11
3.1 資料蒐集與處理 11
3.1.1 資料獲取與倉儲 12
3.1.2 災害名稱的資料處理 13
3.1.3 災害地點的資料處理 14
3.1.4 災情敘述的資料處理 15
3.2 文章分類 16
3.2.1 勝算比 16
3.2.2 卡方檢定 17
3.2.3 資訊獲利 18
3.3 災害事件擷取 18
4 實驗與系統效能 21
4.1 文章分類的效能評估 21
4.2 命名實體的效能評估 22
4.3 災害名稱模型的效能評估 23
4.4 災害地點模型的效能評估 24
4.5 災情敘述模型的效能評估 26
5 結論 29
6 參考資料 30

參考文獻

[1] https://zh.wikipedia.org/wiki/%E6%89%B9%E8%B8%A2%E8%B8%A2
[2] Murty, Maddipati Narasimha, and Rashmi Raghava. Support Vector Machines and Perceptrons: Learning, Optimization, Classification, and Application to Social Networks. Springer, 2016.
[3] Wang, Wei. ”Chinese news event 5W1H semantic elements extraction for event ontology population.” Proceedings of the 21st International Conference on World Wide Web. ACM, 2012.
[4] Lafferty, John, Andrew McCallum, and Fernando Pereira. ”Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” (2001): 282-289.
[5] http://portal.emic.gov.tw/nfasso/action/ssoLogon.do
[6] Sakaki, Takeshi, Makoto Okazaki, and Yutaka Matsuo. ”Earthquake shakes Twitter users: real-time event detection by social sensors.” Proceedings of the 19th international conference on World wide web. ACM, 2010.
[7] Kryvasheyeu, Yury, et al. ”Rapid assessment of disaster damage using social media activity.” Science advances 2.3 (2016): e1500779.
[8] Blunsom, Phil. ”Hidden markov models.” Lecture notes, August 15 (2004): 18-19.
[9] Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. ”Speech recognition with deep recurrent neural networks.” Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013.
[10] Graves, Alex, et al. ”Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
[11] Graves, Alex. ”Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013)
[12] Ma, Xuezhe, and Eduard Hovy. ”End-to-end sequence labeling via bi-directional lstm-cnns-crf.” arXiv preprint arXiv:1603.01354 (2016).
[13] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. ”Glove: Global vectors for word representation.” EMNLP. Vol. 14. 2014.
[14] Y. Y. Huang, C.H. Chung, “A Tool for Web NER Model Generation Based on Google Snippets,” Proceedings of the 27th Conference on Computational Linguistics and Speech Processing, pp. 148–163, ROCLING, 2015.
[15] Chou, Chien-Lung, Chia-Hui Chang, and Ya-Yun Huang. ”Boosted Web Named Entity Recognition via Tri-Training.” ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 16.2 (2016): 10.

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2017-8-24

推文