基於SBERT預訓練模型的專利分類方法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：45

、訪客IP：3.16.130.16

姓名

陳智鈞(Chih-Chun Chen) 查詢紙本館藏

畢業系所

資訊管理學系在職專班

論文名稱

基於SBERT預訓練模型的專利分類方法
(Patent classification method based on pre-trained SBERT model)

相關論文

★ 不動產仲介業銷售住宅類別之成交預測模型—以不動產仲介S公司為例	★ 應用文字探勘技術建構預測客訴問題類別機器學習模型
★ 以機器學習技術建構顧客回購率預測模型：以某手工皂原料電子商務網站為例	★ 以機器學習建構股價預測模型：以台灣股市為例
★ 以機器學習方法建構財務危機之預測模型：以台灣上市櫃公司為例	★ 運用資料探勘技術於股票填息之預測模型：以台灣股市上市公司為例
★ 運用資料探勘技術優化次世代防火牆規則之研究	★ 應用資料探勘技術於電子病歷文本中識別相關新資訊
★ 應用深度學習於藥品後市場監督：Twitter文本分類任務	★ 運用電子病歷與資料探勘技術建構腦中風病人心房顫動預測模型
★ 考量特徵選取與隨機森林之遺漏值填補技術	★ 電子病歷縮寫消歧與一對多分類任務
★ 運用Meta-path與注意力機制改善個人化穿搭推薦	★ 運用機器學習技術建構核保風險預測模型：以A公司為例
★ 風扇壽命預測使用大數據分析－以 X 公司為例	★ 使用文字探勘與深度學習技術建置中風後肺炎之預測模型

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2029-7-1以後開放)

摘要(中)

在當今知識經濟時代，專利的管理和分類對於保護創新成果至關重要。隨著專利申請案件數量不斷增加，傳統的人工分類方式效率低下且成本高昂。因此，發展能夠準確高效進行專利分類的自動化方法變得迫在眉睫。近年來，人工智慧的自然語言處理技術取得有效進展，尤其是 BERT 和 SBERT 等預訓練語言模型在文本分類任務上表現優異，為自動專利分類開啟了新契機。本研究旨在探索如何利用先進的 AI 技術，基於SBERT 模型建構機器學習和深度學習方法，以提高專利文件分類的準確性。我們評估了 SBERT 在處理專利文本複雜性和大量資料方面的有效性，並探討了各種預訓練模型在專利分類任務中的性能表現。為了驗證所提出方法的有效性，本研究使用我國2015 至 2023 年的專利公開資料，共計 136,013 件專利案件。我們將其中的 115,008 件作為訓練集，其餘 21,005 件作為測試集。在實驗中，我們採用了 10 種不同的預訓練模型對專利的名稱、摘要、申請專利範圍和描述等不同文本組合進行特徵提取。隨後，運用基於餘弦相似度的分類方法和機器學習分類器，對專利的 IPC 分類號進行多層次的分類預測。透過準確率、召回率、F1 值等多種評估指標，我們全面評估了各種模型和分類策略的效果。實驗結果表明，使用 SBERT 的 DBMC_V1 模型，結合專利的描述文本作為特徵，並採用基於餘弦相似度的樂觀法進行分類，可以在三階 IPC 分類任務上取得最優的性能表現。此外，本研究還發現，針對不同資料的分類任務，採用不同的模型組合策略可以進一步提升分類效果。本研究基於 SBERT 的方法在專利分類任務上展現出了顯著的優越性，但仍存在一些值得關注的局限性，如資料類別不平衡、缺乏專門的模型優化等，需要在未來工作中進一步探索和改進。

摘要(英)

In today′s knowledge economy, the management and classification of patents are crucial for protecting innovative results. With the number of patent applications increasing, traditional manual classification methods are inefficient and costly. Thus, developing accurate and efficient automated methods for patent classification has become imperative. In recent years, advancements in artificial intelligence′s natural language processing, particularly pre-trained language models like BERT and SBERT, have shown excellent performance in text classification tasks, opening new opportunities for automated patent classification. This study aims to explore how advanced AI technologies, based on the SBERT model, can be utilized to construct machine learning and deep learning methods to enhance the accuracy of patent document classification. We assessed the effectiveness of SBERT in handling the complexity and large volume of patent texts and explored the performance of various pre-trained models in patent classification tasks. To validate the effectiveness of the proposed methods, this study utilized publicly available patent data from Taiwan from 2015 to 2023, totaling 136,013 patent cases. We used 115,008 of these as a training set and the remaining 21,005 as a test set. In our experiments, we employed 10 different pre-trained models to extract features from various textual components of patents, such as titles, abstracts, claims, and descriptions. Subsequently, we used cosine similarity-based classification methods and machine learning classifiers to predict the International Patent Classification (IPC) codes at multiple levels. The effectiveness of various models and classification strategies was comprehensively assessed using metrics such as accuracy, recall, and F1 score. The experimental results show that using the SBERT-based DBMC_V1 model, combined with the complete descriptive text of patents as features, and employing a cosine similarity-based optimistic approach for classification, achieves the best performance in the three-level IPC classification tasks. Additionally, the study found that adopting different model combination strategies for classification tasks with different data can further enhance classification effectiveness. The SBERT-based approach demonstrated significant superiority in patent classification tasks, but there are still some limitations worth noting, such as imbalanced data categories and a lack of specialized model optimization, which need to be further explored and improved in future work.

關鍵字(中)

★ 專利分類
★ SBERT
★ BERT
★ 文本分類
★ 餘弦相似度

關鍵字(英)

★ Patent classification
★ SBERT
★ BERT
★ text classification
★ cosine similarity

論文目次

摘要 I
Abstract II
目錄 IV
表目錄 VI
圖目錄 VII
第一章緒論 1
1.1 研究背景 1
1.2 研究動機 2
1.3 研究目的及研究貢獻 4
第二章文獻探討 6
2.1 專利文件介紹 6
2.2 專利文件分類相關研究 6
第三章研究方法 15
3.1 資料來源 15
3.2 資料處理 17
3.3 使用工具方法 23
3.4 實驗設計 27
3.5 評估標準 29
第四章實驗結果分析 32
4.1 實驗結果 32
4.2 實驗小結 45
第五章結論與建議 47
5.1 研究結論與貢獻 47
5.2 研究限制 48
5.3 未來研究方向與建議 49
參考文獻 51
附錄 54

參考文獻

吳柏成 (2022). 以BERT為基之中文文件相似度計算—應用於專利文件之分類與分群。
張晉源、管中徽 (2018). IPC 或 CPC? 美國專利分類系統的比較與分析. http://www.maxkuan.tw/lib/exe/fetch.php?media=c13.pdf
戴余修 (2021). 基於 BERT 預訓練模型的專利檢索方法。

Bekamiri, H., Hain, D., & Jurowetzki, R. (2022). Patentsberta: A Deep Nlp Based Hybrid Model for Patent Distance and Classification Using Augmented Sbert1 (SSRN Scholarly Paper 4077952). https://papers.ssrn.com/abstract=4077952
Blokhina, Yu. V., & Ilin, A. S. (2021). Use of Patent Classification in Searching for Biomedical Information. Russian Journal of Bioorganic Chemistry, 47(6), 1225–1230. https://doi.org/10.1134/S1068162021060066
Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., & Hu, G. (2020). Revisiting Pre-Trained Models for Chinese Natural Language Processing. Findings of the Association for Computational Linguistics: EMNLP 2020, 657–668. https://doi.org/10.18653/v1/2020.findings-emnlp.58
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). arXiv. https://doi.org/10.48550/arXiv.1810.04805
Endres, M., Chikkamath, R., Parmar, V. R., & Otiefy, Y. (2022). Patent classification using BERT-for-patents on USPTO. https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/index/index/docId/98610
Fall, C. J., Törcsvári, A., Benzineb, K., & Karetka, G. (2003). Automated categorization in the international patent classification. ACM SIGIR Forum, 37(1), 10–25. https://doi.org/10.1145/945546.945547
Haghighian Roudsari, A., Afshar, J., Lee, W., & Lee, S. (2022). PatentNet: Multi-label classification of patent documents using deep learning based language understanding. Scientometrics, 127(1), 207–231. https://doi.org/10.1007/s11192-021-04179-4
Henriques, R., Ferreira, A., & Castelli, M. (2022). A Use Case of Patent Classification Using Deep Learning with Transfer Learning. Journal of Data and Information Science, 7(3), 49–70. https://doi.org/10.2478/jdis-2022-0015
Jiang, S., Hu, J., Magee, C. L., & Luo, J. (2024). Deep Learning for Technical Document Classification. IEEE Transactions on Engineering Management, 71, 1163–1179. https://doi.org/10.1109/TEM.2022.3152216
Joshi, U., Hedaoo, M., Fatnani, P., Bansal, M., & More, V. (2022). Patent Classification with Intelligent Keyword Extraction. 2022 6th International Conference On Computing, Communication, Control And Automation (ICCUBEA, 1–7. https://doi.org/10.1109/ICCUBEA54992.2022.10010888
Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents (arXiv:1405.4053). arXiv. https://doi.org/10.48550/arXiv.1405.4053
Lee, J.-S., & Hsiang, J. (2020). Patent classification by fine-tuning BERT language model. World Patent Information, 61, 101965. https://doi.org/10.1016/j.wpi.2020.101965
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach (arXiv:1907.11692). arXiv. https://doi.org/10.48550/arXiv.1907.11692
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality (arXiv:1310.4546). arXiv. http://arxiv.org/abs/1310.4546
Navrozidis, J., & Jansson, H. (2020). Using Natural Language Processing to Identify Similar Patent Documents | Lund University [lunduniversity]. https://www.lunduniversity.lu.se/lup/publication/9008699
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. , Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162
Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084). arXiv. https://doi.org/10.48550/arXiv.1908.10084
Song, Y., Shi, S., Li, J., & Zhang, H. (2018). Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. , Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (175–180). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2028
Ye, C., Qi, J., & Lvwei. (2011). Research on the development of intellectual property economic under the new economic growth. 2011 International Conference on Business Management and Electronic Information, 5, 24–27. https://doi.org/10.1109/ICBMEI.2011.5914422
Yehe, N. (2020). Automatic Patent Classification. https://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-49594
Yoo, Y., Heo, T.-S., Lim, D., & Seo, D. (2023). Multi label classification of Artificial Intelligence related patents using Modified D2SBERT and Sentence Attention mechanism (arXiv:2303.03165). arXiv. https://doi.org/10.48550/arXiv.2303.03165

指導教授

胡雅涵(Ya-Han Hu)

審核日期

2024-6-26

推文