摘要(英) |
In today′s knowledge economy, the management and classification of patents are crucial for protecting innovative results. With the number of patent applications increasing, traditional manual classification methods are inefficient and costly. Thus, developing accurate and efficient automated methods for patent classification has become imperative. In recent years, advancements in artificial intelligence′s natural language processing, particularly pre-trained language models like BERT and SBERT, have shown excellent performance in text classification tasks, opening new opportunities for automated patent classification. This study aims to explore how advanced AI technologies, based on the SBERT model, can be utilized to construct machine learning and deep learning methods to enhance the accuracy of patent document classification. We assessed the effectiveness of SBERT in handling the complexity and large volume of patent texts and explored the performance of various pre-trained models in patent classification tasks. To validate the effectiveness of the proposed methods, this study utilized publicly available patent data from Taiwan from 2015 to 2023, totaling 136,013 patent cases. We used 115,008 of these as a training set and the remaining 21,005 as a test set. In our experiments, we employed 10 different pre-trained models to extract features from various textual components of patents, such as titles, abstracts, claims, and descriptions. Subsequently, we used cosine similarity-based classification methods and machine learning classifiers to predict the International Patent Classification (IPC) codes at multiple levels. The effectiveness of various models and classification strategies was comprehensively assessed using metrics such as accuracy, recall, and F1 score. The experimental results show that using the SBERT-based DBMC_V1 model, combined with the complete descriptive text of patents as features, and employing a cosine similarity-based optimistic approach for classification, achieves the best performance in the three-level IPC classification tasks. Additionally, the study found that adopting different model combination strategies for classification tasks with different data can further enhance classification effectiveness. The SBERT-based approach demonstrated significant superiority in patent classification tasks, but there are still some limitations worth noting, such as imbalanced data categories and a lack of specialized model optimization, which need to be further explored and improved in future work. |
參考文獻 |
吳柏成 (2022). 以BERT為基之中文文件相似度計算—應用於專利文件之分類與分群。
張晉源、管中徽 (2018). IPC 或 CPC? 美國專利分類系統的比較與分析. http://www.maxkuan.tw/lib/exe/fetch.php?media=c13.pdf
戴余修 (2021). 基於 BERT 預訓練模型的專利檢索方法。
Bekamiri, H., Hain, D., & Jurowetzki, R. (2022). Patentsberta: A Deep Nlp Based Hybrid Model for Patent Distance and Classification Using Augmented Sbert1 (SSRN Scholarly Paper 4077952). https://papers.ssrn.com/abstract=4077952
Blokhina, Yu. V., & Ilin, A. S. (2021). Use of Patent Classification in Searching for Biomedical Information. Russian Journal of Bioorganic Chemistry, 47(6), 1225–1230. https://doi.org/10.1134/S1068162021060066
Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., & Hu, G. (2020). Revisiting Pre-Trained Models for Chinese Natural Language Processing. Findings of the Association for Computational Linguistics: EMNLP 2020, 657–668. https://doi.org/10.18653/v1/2020.findings-emnlp.58
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arXiv:1810.04805). arXiv. https://doi.org/10.48550/arXiv.1810.04805
Endres, M., Chikkamath, R., Parmar, V. R., & Otiefy, Y. (2022). Patent classification using BERT-for-patents on USPTO. https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/index/index/docId/98610
Fall, C. J., Törcsvári, A., Benzineb, K., & Karetka, G. (2003). Automated categorization in the international patent classification. ACM SIGIR Forum, 37(1), 10–25. https://doi.org/10.1145/945546.945547
Haghighian Roudsari, A., Afshar, J., Lee, W., & Lee, S. (2022). PatentNet: Multi-label classification of patent documents using deep learning based language understanding. Scientometrics, 127(1), 207–231. https://doi.org/10.1007/s11192-021-04179-4
Henriques, R., Ferreira, A., & Castelli, M. (2022). A Use Case of Patent Classification Using Deep Learning with Transfer Learning. Journal of Data and Information Science, 7(3), 49–70. https://doi.org/10.2478/jdis-2022-0015
Jiang, S., Hu, J., Magee, C. L., & Luo, J. (2024). Deep Learning for Technical Document Classification. IEEE Transactions on Engineering Management, 71, 1163–1179. https://doi.org/10.1109/TEM.2022.3152216
Joshi, U., Hedaoo, M., Fatnani, P., Bansal, M., & More, V. (2022). Patent Classification with Intelligent Keyword Extraction. 2022 6th International Conference On Computing, Communication, Control And Automation (ICCUBEA, 1–7. https://doi.org/10.1109/ICCUBEA54992.2022.10010888
Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents (arXiv:1405.4053). arXiv. https://doi.org/10.48550/arXiv.1405.4053
Lee, J.-S., & Hsiang, J. (2020). Patent classification by fine-tuning BERT language model. World Patent Information, 61, 101965. https://doi.org/10.1016/j.wpi.2020.101965
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach (arXiv:1907.11692). arXiv. https://doi.org/10.48550/arXiv.1907.11692
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space (arXiv:1301.3781). arXiv. https://doi.org/10.48550/arXiv.1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality (arXiv:1310.4546). arXiv. http://arxiv.org/abs/1310.4546
Navrozidis, J., & Jansson, H. (2020). Using Natural Language Processing to Identify Similar Patent Documents | Lund University [lunduniversity]. https://www.lunduniversity.lu.se/lup/publication/9008699
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global Vectors for Word Representation. , Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162
Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries.
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (arXiv:1908.10084). arXiv. https://doi.org/10.48550/arXiv.1908.10084
Song, Y., Shi, S., Li, J., & Zhang, H. (2018). Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings. , Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (175–180). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2028
Ye, C., Qi, J., & Lvwei. (2011). Research on the development of intellectual property economic under the new economic growth. 2011 International Conference on Business Management and Electronic Information, 5, 24–27. https://doi.org/10.1109/ICBMEI.2011.5914422
Yehe, N. (2020). Automatic Patent Classification. https://urn.kb.se/resolve?urn=urn:nbn:se:hj:diva-49594
Yoo, Y., Heo, T.-S., Lim, D., & Seo, D. (2023). Multi label classification of Artificial Intelligence related patents using Modified D2SBERT and Sentence Attention mechanism (arXiv:2303.03165). arXiv. https://doi.org/10.48550/arXiv.2303.03165 |