摘要: | 在知識圖譜的快速發展,如今相關應用越來越豐富,但伴隨著資料量日漸龐大,使圖譜建立上的複雜度也提高,如果建立圖譜的時間成本過高,則會影響服務的實時性,所以如何在大資料量的情境下,高效的建立知識圖譜是一個重要的議題。本論文以基於Hadoop + Spark分散式運算的架構來針對RDB-to-RDF的情境來建立知識圖譜建置系統,並搭建了實驗環境,與Antidot公司開源DB2Triples系統作為比較的對象。在實驗環境中,採用了圖書館借閱開放資料,一共約有近九千三百萬筆的資料,來模擬現實中大量資料的情境,比較時採以漸進式的方式,從小的資料量開始累進至大的資料量,來比較在不同的資料量時的圖譜建置效能,論文最後則依據實驗數據來做綜合性的評比。從實驗結果可以得知,在千筆資料以內時,DB2Triples擁有比較快速的圖譜建置時間,但約莫到兩千筆資料時,本論文所實現的分散式系統已實現反超,到達一萬筆時,已經快了約六倍,且隨著資料量的累進,差距則越來越明顯。;With the rapid development of knowledge graphs, related applications are becoming more and more abundant, but with the increasing amount of data, the complexity of graph establishment has also increased. If the time cost of establishing graphs is too long, it will affect the real-time performance of services. Therefore, how to efficiently build a knowledge graph in the context of a large amount of data is an important issue. In this paper, according to the distributed computing architecture based on Hadoop + Spark, a knowledge graph construction system is established for the RDF-to-RDF situation, and an experimental environment is built, which is compared with Antidot′s open-source DB2Triples system. In the experiment, the library’s open-source materials were borrowed, with a total of nearly 93 million pieces of materials, to simulate the situation of a large number data source in reality. This experiment adopts a progressive method when comparing, in order to compare the building performance with different amounts of data. Starting from a small amount of data and increasing to a large amount of data. Finally, make a comprehensive evaluation based on the experimental data. From the experimental results, when there is less than 1000 data , DB2Triples has a faster building time. But when it reaches about 2,000 data, distributed computing has surpassed the former. When it reaches about 10,000 data, distributed computing is now 6 times faster. The gap increases as the amount of data increases |