dc.description.abstract | In the field of speech processing, the rapid advancement of deep learning technology is increasingly evident. Voice conversion technology is an innovative application that combines speech processing techniques. It allows for the transformation of the acoustic features of one speaker′s voice into those of another speaker, while maintaining the unchanged content of speech. This achievement brings about a more natural, vivid, and diverse synthesized speech. This technology not only plays a crucial role in applications such as automatic speech systems and virtual characters but also finds utility in emotional expression. By altering vocal characteristics, it conveys different emotions, thus enhancing the user experience.
Taking music as an example, it is a universally present art form with powerful emotional impact. Music not only provides entertainment but also fosters communication and reflects societal values and culture. However, music contains a wealth of information, and post-production processes play a vital role in ensuring the quality of music. Therefore, this paper centers around music conversion technology, conducting experiments and technical exploration to examine various applications in music processing.
Yet, music conversion technology faces several challenges, including achieving high-quality synthetic speech and accurately extracting and converting vocal characteristics. While large-scale language models are effective at feature extraction, their application on resource-constrained devices requires consideration of model light weighting.
Hence, this study delves deeply into "Post-Production in Music and Low-Latency Singing Conversion," utilizing modern technology to address constraints in traditional music production. Through disentanglement representation learning, detailed analysis of acoustic information is conducted to achieve the generation of transformed audio. To meet low-latency requirements, we introduce temporal convolutional networks as a replacement for traditional recurrent neural networks, achieving low-computation and rapid transformation effects. Experimental results demonstrate that this architecture outperforms traditional recurrent neural networks in both transformation quality and processing speed.
By introducing such low-latency vocal conversion technology in music, we aspire to resolve singer intonation issues, enhance music quality and production efficiency, benefiting studio professionals while saving time and resources, and enhancing music production flexibility. Furthermore, the continuous advancement of music technology contributes to a better understanding of its potential and limitations, providing fresh inspiration for future music production tools and methods. | en_US |