| 摘要: | 語言模型能使電腦理解並表達人類語言,因此衍生出多種領域的應用,其中一個應用領域是資訊檢索而現今資訊檢索常採用的語言模型主要可分為兩類,也就是傳統語言模型(如:TF-IDF (Term Frequency–Inverse Document Drequency)),以及大型語言模型(如:SentenceTransformer/all-mpnet-base-v2(ST-AMB2)),由於兩類語言模型在特性上各有不同,導致檢索出的文獻結果有所差異。在另一方面,個別差異性存在於使用者之間,過去之研究發現影響資訊檢索行為其中一項目的個別差異性就是性別。然而這些研究多著重於不同性別使用者於資訊檢索的行為差異,卻較少從語言模型視角,探討性別差異對於資訊檢索的影響。
為彌補此研究缺口,本研究以蓬勃發展的數位學習文獻為基礎,從語言模型及性別差異角度,分析傳統語言模型與大型語言模型的效能差異。前者所選擇的代表是TF-IDF,而且後者所選擇的代表是ST-AMB2。更具體的說,本研究分為兩個層面,分別為「研究一:模型層面」以及「研究二:學習者回饋層面」。在研究一中,注重於TF-IDF與ST-AMB2在檢索數位學習文獻中的差異,分析面向包括:效能評估、文獻分布以及關鍵詞差異。研究二進一步結合TF-IDF與ST-AMB2的特性,開發一套文獻檢索系統,以學習者層面,探討不同性別學習者與文獻檢索系統互動時的差異,其中包括:按鍵使用頻率、任務執行順序以及語言模型關鍵詞偏好。此兩個研究的結果用於回答以下四個研究問題:
1. TF-IDF和Sentence-Transformer/all-mpnet-base-v2對於識別數位學習領域文獻之關鍵詞的效能差異為何? 2. TF-IDF 與 Sentence-Transformer/all-mpnet-base-v2模型所識別之數位學習領域文獻的關鍵詞分布上的差異為何? 3. 男性學習者與女性學習者在使用數位學習文獻檢索系統時,會有何不同的互動方式? 4. 依據各類模型所提供的關鍵詞,男性學習者與女性學習者在判斷文獻相關程度時會有何不同的關聯性?
關於第一個研究問題, TF-IDF 在精確率與錯誤率上表現穩定,但再召回率與遺漏率相對不足,相較之下,ST-AMB2 則具備較高的召回率與較低的遺漏率,而在精確率表現較差。從上述結果可以發現,兩者明顯呈現互補的特性,TF-IDF 擅長詞彙層面的精準擷取,而 ST-AMB2 則擅於語意層面的廣泛檢索。關於第二個研究問題,當關鍵詞同時高頻出現在標題與摘要時,兩種模型皆能正確擷取詞彙,若關鍵詞僅完整出現於標題,則是TF-IDF 能準確擷取,而 ST-AMB2 則會因詞彙出現過少而遺漏關鍵詞。相反的,當關鍵詞未明確出現於文獻中,但存在多個語意相近詞彙時,ST-AMB2 卻能擷取相應的詞彙,彌補了 TF-IDF 在處理不完整詞彙的限制。另外, TF-IDF 傾向擷取結構固定的專業詞彙,而 ST-AMB2 則擅長識別表現形式不一致的詞彙,至於兩者皆較少擷取的關鍵詞,多屬於方法論章節用語或跨領域的詞彙。 綜合上述的結果,可發現傳統與大型語言模型在數位學習文獻檢索中,展現出互補關係。
關於第三個研究問題,男性與女性學習者在按鍵使用頻率方面既存在差異,也呈現出部分相似的模式,在差異部分,男性學習者在多次點擊「上一步」或返回首頁重新檢索時,較少將文獻評為「非常相關」或「非常不相關」。而女性學習者,則是頻繁點擊「幫助」按鈕時,較常將檢索結果評為「中等相關」,此外,女性學習者頻繁使用「Next」按鈕切換頁面時,所選關鍵詞多來自 ST-AMB2 模型,並更傾向將文獻判讀為「非常不相關」。在相似部分,不論男女的學習者,頻繁使用「摘要」按鈕時,其所選的關鍵詞多來自 TF-IDF 模型。此外,在任務執行順序部分,不論男女學習者,皆會依循固定的檢索流程完成任務,由第一個任務逐步執行至最後一個。
關於第四個研究問題,男性學習者傾向依賴單一語言模型所生成的關鍵詞檢索,而女性學習者則偏好採取綜合性策略。而進一步分析,關鍵詞使用頻率與文獻關聯程度的關係可發現,女性無論使用 TF-IDF、ST-AMB2 或Both(混合TF-IDF與STAMB2)模型,其結果多集中於「中等相關」或「非常不相關」的文獻。相較之下,男性在不同模型下的結果則是有所差異,當頻繁使用 ST-AMB2 模型時,能明顯提升「非常相關」與「中等相關」文獻的比例,而 TF-IDF 模型對文獻關聯程度的影響則相對有限,至於Both模型,則呈現兩性一致的趨勢,皆傾向增加「中等相關」與「非常不相關」文獻的比例。
上述結果顯示出,不同語言模型在數位學習文獻的關鍵詞擷取上各具特性,並在不同性別學習者對於此兩種語言模型會有不同的回應,而此結果可作為未來文獻檢索系統設計的參考依據,開發人員可透過性別學習者的行為差異,提供不同語言模型選項與關鍵詞建議,使學習者能依自身需求選擇合適模型,進而加強文獻檢索系統的個別化特色與文獻結果關聯程度。 ;Language models enable computers to understand and produce human languages, being applied in various domains. In the domain of information retrieval, two main types of language model are commonly used: traditional models, such as Term Frequency–Inverse Document Frequency (TF-IDF) (, and large language models, such as SentenceTransformer/all-mpnet-base-v2 (ST-AMB2). Because these two main types differ in their underlying representations, they often return different sets of documents. On the other hand, individual differences among users shape search behavior; prior work identifies gender differences as one such factor. However, most of theseworks emphasize search behavior while there is a lack of studies to examine g how gender differences are related to users’ reactions to the choices of language models.
To address this gap, this research analyzed fast-growing digital-learning literature from the dual perspectives of language modeling and gender differences. The TF-IDF was applied to to represent traditional models and the ST-AMB2 was employed to represent large language models. Two studies were conducted. Study 1 (theoretical aspect) compared TF-IDF and ST-AMB2 for retrieving digital-learning articles, examining performance, document-type distribution, and keyword characteristics; Study 2 (user feedback aspect) integrated the strengths of both models to build a information retrieval system and investigated the effects of gender differences on users’ interaction patterns, including button-use frequency, task execution order, and keyword preferences by model. In brief, these two studies addressed the following four research questions
(1) What are the performance differences between TF-IDF and (ST-AMB2 in identifying keywords for digital-learning literature? (2) How do TF-IDF and (ST-AMB2 differ in the keyword distributions for digital-learning literature? (3) How do males and females differ in their interactions with a digital-learning literature retrieval system?
(4) Given model-provided keywords, how do males and females differ in their judgments of document relevance?
Regarding Research Question 1, TF-IDF showed stable precision with higher false-negative rates (lower recall), whereas ST-AMB2 achieved higher recall with more false-positive rate (lower precision). These patterns suggested complementarity: TF-IDF favored precise lexical matching, while ST-AMB2 exceled at broader semantic coverage. Regarding Research Question 2, when a keyword appeared frequently in both title and abstract, both models extracted it reliably; when it appeared fully only in the title, TF-IDF extracted it accurately whereas ST-AMB2 tended to miss it due to sparse occurrences. Conversely, when exact terms were absent but multiple semantically related expressions were present, ST-AMB2 retrieved the relevant vocabulary, compensating for TF-IDF’s limitations with incomplete terms. TF-IDF also favored fixed, domain-specific phrases, whereas ST-AMB2 more readily recognized variants with inconsistent surface forms; keywords rarely captured by either model often belonged to methodological jargon or cross-disciplinary terms. Overall, traditional and large language models exhibited a complementary relationship in digital-learning literature retrieval.
Regarding Research Question 3, male and female learners showed both differences and similarities in button-use frequency. In the aspect of differences, males who repeatedly clicked “Back” or returned to the home page were less likely to rate documents as “highly relevant” or “highly irrelevant.” Females who frequently pressed “Help” more often rated results as “moderately relevant.” In addition, when females frequently used the “Next” button, they tended to select ST-AMB2-generated keywords and more often judged documents as “highly irrelevant.” In the aspect of similarities, frequent use of the “Abstract” button—by both genders—was associated with selecting TF-IDF-generated keywords. Both groups also followed a consistent task sequence, progressing from the first to the last task.
Regarding Research Question 4, males tended to rely on keywords from a single model, whereas females preferred a combined strategy. In the aspect of the relationship between keyword-use frequency and perceived relevance, females—regardless of using TF-IDF, ST-AMB2, or Both (a TF-IDF + ST-AMB2 hybrid)—more often produced “moderately relevant” or “highly irrelevant” results. In contrast, males showed model-sensitive outcomes: frequent use of ST-AMB2 increased the proportions of “highly relevant” and “moderately relevant” documents, whereas TF-IDF had a comparatively limited effect. Under the Both setting, both females and males converged toward higher proportions of “moderately relevant” and “highly irrelevant” documents.
These results indicated that the two main types of model possessed distinct strengths in keyword identification for digital-learning literature and that male and female learners reacted differently to model-provided keywords. These findings can inform the design of future retrieval systems: developers can incorporate suitable language models and keyword recommendations tailored to user characteristics and observed behaviors, allowing learners to select models that fit their needs and thereby enhance personalization and the relevance of retrieved results. |