全球電腦設備及手持裝置的激增,新興市場連網越來越普及,各個企業組織內部與外部的電子文件呈現幾何級數的方式快速成長。根據IDC(the International Data Corporation)的報告指出,到了2020年預估每年產生40 ZB的資料。IDC也進一步說明,一個企業組織中幾乎80%的資料是屬於文字型態資料,從IDC的報導可知,「非結構化資料、文字型態資料」的資料探勘,即文字探勘還有很大的應用與發展空間,甚至美國麻省理工學院將自然語言處裡與文字探勘選為未來十年重要技術之一。 企業組織中的文字型態資料,皆為人類自然語言所組成,內容充滿多樣性、複雜與獨特性。若由人工的方式判斷分類文字型態資料,不僅不符合經濟效益且難度甚高,更重要的是沒有一個公認標準。因此本研究提出從已分類文件中擷取出關鍵字以及字詞頻率,再透過餘弦相似度計算查詢文件與文字探勘模型之間的相似度,最後根據相似度,協助人工正確地分類與提升人工執行效率。 本研究針對業務部門處理客戶詢價時,最繁重的環節就是將客戶需求規格轉換成產品料號的作業,現行是以人工方式執行客戶需求規格轉換產品料號。因為倚賴人工的方式執行,就有機會發生轉換成錯誤的料號並且人工作業的效率也不好。針對以上的問題,使用文字探勘技術與餘弦相似度計算,取得客戶需求規格與產品料號之間的相似度,業務部門人員再根據相似度,快速完成客戶需求規格轉換成產品料號的作業。測試資料集由個案公司提供進行測試與驗證,透過本研究開發的系統原型,分別進行三組客戶需求規格的文字探勘,然後在測試資料集隨機抽樣客戶需求規格,再透餘弦相似度計算,相似度最高者即是轉換成的產品料號,皆可以正確地轉換產品料號。經由使用者測試使用並討論後,認為具有高度的導入價值,確認是可以提升人工分類的正確率與客戶詢價作業的執行效率。 ;With the growth of the Information Technology and Smartphone popularity, electronic documents inside and outside the company will continue to increase exponentially. IDC now forecasts that we′ll be generating 40 ZB. They also state that unstructured information might account for more than 80% of all data in organizations. The new age text analysis tools have emerged as the must-have tools for enterprises in order to gain insights for informed decision making and other processes. Today, an increasing amount of information is being held in unstructured and semi-structured formats which organizations manage (and the additional information that they’d like to include) continues to grow and diversify. The primary problem with the management of all of these unstructured and semi-structured text data is that there are no standard rules for writing text so that a computer can understand it. First, this paper extracts keywords and word frequency from classified documents. Second, this paper calculates the similarity between sample and model documents using cosine similarity. Finally, this paper clusters validity based on the most similarity. In my case it would be extraneous specification turning into our product part number that it’s crucial and critical processes. Since all processes are being by manual, the mistake always occurs and it’s time-consuming. To combat the problem, I used the Cosine Similarity algorithm to work out the similarity between the specification and product part numbers. The salesperson then used the similarity to convert the specification into product part number rapidly. In this scenario, I developed a text mining system prototype to derive patterns from three different specifications and then did Cosine Similarity via random sampling, the most similarity would turn into product part number and the result turned out to be 100% accuracy. The text mining can solve high-value information comparison problems and mitigate heavy tasks and operational risks for sales team.