摘要: | 在當今知識經濟時代,專利的管理和分類對於保護創新成果至關重要。隨著專利申請案件數量不斷增加,傳統的人工分類方式效率低下且成本高昂。因此,發展能夠準確高效進行專利分類的自動化方法變得迫在眉睫。近年來,人工智慧的自然語言處理技術取得有效進展,尤其是 BERT 和 SBERT 等預訓練語言模型在文本分類任務上表現優異,為自動專利分類開啟了新契機。 本研究旨在探索如何利用先進的 AI 技術,基於SBERT 模型建構機器學習和深度學習方法,以提高專利文件分類的準確性。我們評估了 SBERT 在處理專利文本複雜性和大量資料方面的有效性,並探討了各種預訓練模型在專利分類任務中的性能表現。為了驗證所提出方法的有效性,本研究使用我國2015 至 2023 年的專利公開資料,共計 136,013 件專利案件。我們將其中的 115,008 件作為訓練集,其餘 21,005 件作為測試集。在實驗中,我們採用了 10 種不同的預訓練模型對專利的名稱、摘要、申請專利範圍和描述等不同文本組合進行特徵提取。隨後,運用基於餘弦相似度的分類方法和機器學習分類器,對專利的 IPC 分類號進行多層次的分類預測。透過準確率、召回率、F1 值等多種評估指標,我們全面評估了各種模型和分類策略的效果。實驗結果表明,使用 SBERT 的 DBMC_V1 模型,結合專利的描述文本作為特徵,並採用基於餘弦相似度的樂觀法進行分類,可以在三階 IPC 分類任務上取得最優的性能表現。此外,本研究還發現,針對不同資料的分類任務,採用不同的模型組合策略可以進一步提升分類效果。本研究基於 SBERT 的方法在專利分類任務上展現出了顯著的優越性,但仍存在一些值得關注的局限性,如資料類別不平衡、缺乏專門的模型優化等,需要在未來工作中進一步探索和改進。;In today′s knowledge economy, the management and classification of patents are crucial for protecting innovative results. With the number of patent applications increasing, traditional manual classification methods are inefficient and costly. Thus, developing accurate and efficient automated methods for patent classification has become imperative. In recent years, advancements in artificial intelligence′s natural language processing, particularly pre-trained language models like BERT and SBERT, have shown excellent performance in text classification tasks, opening new opportunities for automated patent classification. This study aims to explore how advanced AI technologies, based on the SBERT model, can be utilized to construct machine learning and deep learning methods to enhance the accuracy of patent document classification. We assessed the effectiveness of SBERT in handling the complexity and large volume of patent texts and explored the performance of various pre-trained models in patent classification tasks. To validate the effectiveness of the proposed methods, this study utilized publicly available patent data from Taiwan from 2015 to 2023, totaling 136,013 patent cases. We used 115,008 of these as a training set and the remaining 21,005 as a test set. In our experiments, we employed 10 different pre-trained models to extract features from various textual components of patents, such as titles, abstracts, claims, and descriptions. Subsequently, we used cosine similarity-based classification methods and machine learning classifiers to predict the International Patent Classification (IPC) codes at multiple levels. The effectiveness of various models and classification strategies was comprehensively assessed using metrics such as accuracy, recall, and F1 score. The experimental results show that using the SBERT-based DBMC_V1 model, combined with the complete descriptive text of patents as features, and employing a cosine similarity-based optimistic approach for classification, achieves the best performance in the three-level IPC classification tasks. Additionally, the study found that adopting different model combination strategies for classification tasks with different data can further enhance classification effectiveness. The SBERT-based approach demonstrated significant superiority in patent classification tasks, but there are still some limitations worth noting, such as imbalanced data categories and a lack of specialized model optimization, which need to be further explored and improved in future work. |