摘要: | 在當前的科技浪潮中,深度學習技術已成為推動人工智能發展的核心力量。語音合成系統,作為人機互動的重要橋樑,對於提升機器的理解能力和自然交互性具有關鍵作用。本研究探討了深度學習技術在語音合成系統中的應用,特別針對護理和倉儲這兩個具有高度應用潛力的場景進行了詳細的分析與實驗。 我們選擇並實現了小型多頻帶短時傅立葉反轉換變分推理對抗學習的端到端語音合成 (Mini Multi-Band Inverse-Short-Time-Fourier-Transform Variational Inference with adversarial learning for end-to-end Text-To-Speech, Mini-MB-iSTFT-VITS)模型,該模型通過減少多頻帶短時傅立葉反轉換變分推理對抗學習的端到端語音合成 (Multi-Band Inverse-Short-Time-Fourier-Transform Variational Inference with adversarial learning for end-to-end Text-To-Speech, MB-iSTFT-VITS)模型中的隱藏通道數量和層數,顯著降低了參數量並加快了推論速度。實驗結果顯示,使用自行製作的Waki語料庫訓練的模型在主觀評估的平均意見分數 (Mean Opinion Score, MOS)、客觀評估的深度噪音抑制平均意見分數 (Deep Noise Suppression Mean Opinion Score, DNSMOS)和客觀評估的UTokyo-SaruLab Mean Opinion Score (UTMOS)評分中均表現出色,顯示出語音的高自然度和品質。在主觀的比較平均意見分數 (Comparative Mean Opinion Score , CMOS)評估中,Mini-MB-iSTFT-VITS模型雖然略低於真實語音,但與MB-iSTFT-VITS模型的表現相近,顯示出兩者在語音自然度方面的差異不大。此外,在推論速度的測試中,Mini-MB-iSTFT-VITS模型無論使用哪種語料庫,都顯示出顯著的優勢,實時率 (Real Time Factor, RTF)結果顯示其推論速度快於MB-iSTFT-VITS模型,使其更適合實時應用場景。
在護理場景中,語音合成技術的應用可以顯著提高護理人員的工作效率,減少操作錯誤,確保病患能夠得到及時且準確的照護,在此論文中我們開發了一個護理系統,它可以幫助護理師設置語音的護理提醒,使得護理師可以有更多的心力在目前要處理的護理行為,而花費護理師大多的時間其實是在記錄操作過的護理行為,所以在此系統中在設置提醒的同時會將提醒的時間以及要進行的護理行為傳送到我們所設置的護理資料庫裡的護理紀錄中,這可以大幅降低護理人員的上班時間,進一步減輕了護理人員的工作負擔。
在倉儲場景中,語音合成技術的應用能夠有效地提高工作流程效率,幫助員工更快速地完成日常操作,減少人為因素帶來的錯誤,提高整體運營效率,在此論文中我們開發了一個倉儲系統,系統掃描QR碼後生成3段語音的平均時間為1.59秒,生成1段語音的平均時間為0.51秒,說明此系統可以實時生成語音幫助撿貨員降低人為因素帶來的錯誤,並且此系統會在撿貨員開始撿貨以及結束撿貨將資訊傳送到倉儲資料庫,使得倉儲的管理人員可以提高整體營運的效率。 ;In the current wave of technological advancements, deep learning technology has become a core force driving the development of artificial intelligence. Speech synthesis systems, as an important bridge for human-computer interaction, play a crucial role in enhancing machine understanding and natural interaction. This study explores the application of deep learning technology in speech synthesis systems, with a detailed analysis and experiments focusing on two scenarios with high application potential: nursing and warehousing. We selected and implemented a Mini Multi-Band Inverse-Short-Time-Fourier-Transform Variational Inference with adversarial learning for end-to-end Text-To-Speech model (Mini-MB-iSTFT-VITS). By reducing the number of hidden channels and layers in the Multi-Band Inverse-Short-Time-Fourier-Transform Variational Inference with adversarial learning for end-to-end Text-To-Speech (MB-iSTFT-VITS) model, we significantly lowered the number of parameters and accelerated inference speed. Experimental results show that the model trained with the self-created Waki corpus performed excellently in Mean Opinion Score (MOS) for subjective evaluation, Deep Noise Suppression Mean Opinion Score (DNSMOS) for objective evaluation, and UTokyo-SaruLab Mean Opinion Score (UTMOS) for objective evaluation, demonstrating high naturalness and quality of the synthesized speech. In the subjective Comparative Mean Opinion Score (CMOS) evaluation, although the Mini-MB-iSTFT-VITS model slightly lagged behind real speech, its performance was close to the MB-iSTFT-VITS model, indicating minimal difference in speech naturalness between the two. Additionally, in the inference speed tests, the Mini-MB-iSTFT-VITS model showed significant advantages regardless of the corpus used. The Real-Time Factor (RTF) results indicated its inference speed surpassed that of the MB-iSTFT-VITS model, making it more suitable for real-time application scenarios.
In the nursing scenario, the application of speech synthesis technology can significantly improve the efficiency of nursing staff, reduce operational errors, and ensure that patients receive timely and accurate care. In this paper, we developed a nursing system that helps nurses set up voice reminders, allowing them to focus more on current nursing tasks. Since much of the nurses′ time is spent recording completed nursing actions, the system records the reminder times and the nursing actions to be performed in the nursing database. This can substantially reduce nurses′ working hours and further lighten their workload.
In the warehousing scenario, the application of speech synthesis technology can effectively improve workflow efficiency, help employees complete daily operations more quickly, reduce errors caused by human factors, and enhance overall operational efficiency. In this paper, we developed a warehousing system where the average time to generate three segments of speech after scanning a QR code is 1.59 seconds, and the average time to generate one segment of speech is 0.51 seconds. This indicates that the system can generate speech in real-time to assist pickers in reducing errors caused by human factors. Additionally, the system sends information to the warehousing database when pickers start and finish picking, allowing warehouse managers to improve overall operational efficiency. |