參考文獻 |
[1] Philipos C Loizou. Speech enhancement: theory and practice. CRC press, 2007.
[2] Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of speech recognition.
Prentice-Hall, Inc., 1993.
[3] Sebastian Ewert et al. “Score-informed source separation for musical audio
recordings: An overview”. In: IEEE Signal Processing Magazine 31.3 (2014),
pp. 116–124.
[4] David Murray, Lina Stankovic, and Vladimir Stankovic. “An electrical load
measurements dataset of United Kingdom households from a two-year longitudinal study”. In: Scientific data 4.1 (2017), pp. 1–12.
[5] Yahaya Isah Shehu et al. “Sokoto coventry fingerprint dataset”. In: arXiv preprint
arXiv:1807.10609 (2018).
[6] Christopher J Shallue and Andrew Vanderburg. “Identifying exoplanets with
deep learning: A five-planet resonant chain around kepler-80 and an eighth
planet around kepler-90”. In: The Astronomical Journal 155.2 (2018), p. 94.
[7] Ichrak Toumi, Stefano Caldarelli, and Bruno Torrésani. “A review of blind
source separation in NMR spectroscopy”. In: Progress in nuclear magnetic resonance spectroscopy 81 (2014), pp. 37–64.
[8] Tuomas Virtanen. “Speech recognition using factorial hidden Markov models
for separation in the feature space”. In: ICSLP. 2006.
[9] S. Arberet et al. “Blind spectral-GMM estimation for underdetermined instantaneous audio source separation”. In: International Conference on Independent Component Analysis and Signal Separation. Springer. 2009, pp. 751–758.
[10] S. Choi et al. “Blind source separation and independent component analysis: A
review”. In: Neural Information Processing-Letters and Reviews 6.1 (2005), pp. 1–57.
[11] Guoning Hu and DeLiang Wang. “A tandem algorithm for pitch estimation and
voiced speech segregation”. In: TASLP 18.8 (2010), pp. 2067–2079.
[12] Ke Hu and DeLiang Wang. “An unsupervised approach to cochannel speech
separation”. In: IEEE Transactions on audio, speech, and language processing 21.1
(2012), pp. 122–131.
[13] Joseph Keshet and Samy Bengio. “Spectral Clustering for Speech Separation”.
In: (2009).
77[14] Guoning Hu and DeLiang Wang. “Monaural speech segregation based on pitch
tracking and amplitude modulation”. In: IEEE Transactions on neural networks
15.5 (2004), pp. 1135–1150.
[15] Daniel D Lee and H Sebastian Seung. “Algorithms for non-negative matrix factorization”. In: Advances in neural information processing systems. 2001, pp. 556–
562.
[16] Tuomas Virtanen. “Monaural sound source separation by nonnegative matrix
factorization with temporal continuity and sparseness criteria”. In: IEEE transactions on audio, speech, and language processing 15.3 (2007), pp. 1066–1074.
[17] Paris Smaragdis. “Convolutive speech bases and their application to supervised
speech separation”. In: IEEE Transactions on Audio, Speech, and Language Processing 15.1 (2006), pp. 1–12.
[18] Umut ¸Sim¸sekli, Jonathan Le Roux, and John R Hershey. “Non-negative source-filter dynamical system for speech enhancement”. In: 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2014,
pp. 6206–6210.
[19] Tomas Kounovsky and Jiri Malek. “Single channel speech enhancement using
convolutional neural network”. In: ECMSM. IEEE. 2017, pp. 1–5.
[20] Geoffrey E Hinton and Ruslan R Salakhutdinov. “Reducing the dimensionality
of data with neural networks”. In: science 313.5786 (2006), pp. 504–507.
[21] Shahla Parveen and Phil Green. “Speech enhancement with missing data techniques using recurrent neural networks”. In: ICASSP. Vol. 1. IEEE. 2004, pp. I–
733.
[22] Xugang Lu et al. “Speech enhancement based on deep denoising autoencoder.”
In: Interspeech. Vol. 2013. 2013, pp. 436–440.
[23] Yong Xu et al. “A regression approach to speech enhancement based on deep
neural networks”. In: IEEE/ACM TASLP 23.1 (2014), pp. 7–19.
[24] John R Hershey et al. “Deep clustering: Discriminative embeddings for segmentation and separation”. In: ICASSP. IEEE. 2016, pp. 31–35.
[25] Zhong-Qiu Wang, Jonathan Le Roux, and John R Hershey. “Alternative objective functions for deep clustering”. In: ICASSP. IEEE. 2018, pp. 686–690.
[26] Yusuf Isik et al. “Single-channel multi-speaker separation using deep clustering”. In: arXiv preprint arXiv:1607.02173 (2016).
[27] Zhong-Qiu Wang et al. “End-to-end speech separation with unfolded iterative
phase reconstruction”. In: arXiv preprint arXiv:1804.10204 (2018).
[28] Yanliang Jin et al. “Multi-Head Self-Attention-Based Deep Clustering for
Single-Channel Speech Separation”. In: IEEE Access 8 (2020), pp. 100013–100021.
78[29] Zhuo Chen, Yi Luo, and Nima Mesgarani. “Deep attractor network for singlemicrophone speaker separation”. In: ICASSP. IEEE. 2017, pp. 246–250.
[30] Dong Yu et al. “Permutation invariant training of deep models for speakerindependent multi-talker speech separation”. In: ICASSP. IEEE. 2017, pp. 241–
245.
[31] M. Kolbæk et al. “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.10 (2017), pp. 1901–1913.
[32] Hakan Erdogan et al. “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks”. In: ICASSP. IEEE. 2015, pp. 708–
712.
[33] Donald S Williamson, Yuxuan Wang, and DeLiang Wang. “Complex ratio masking for monaural speech separation”. In: IEEE/ACM TASLP 24.3 (2015), pp. 483–
492.
[34] Y. Isik et al. “Single-Channel Multi-Speaker Separation Using Deep Clustering”.
In: Interspeech. 2016.
[35] Y. Luo et al. “Deep clustering and conventional networks for music separation:
Stronger together”. In: ICASSP. IEEE. 2017, pp. 61–65.
[36] Y. Luo and N. Mesgarani. “Tasnet: time-domain audio separation network for
real-time, single-channel speech separation”. In: ICASSP. IEEE. 2018, pp. 696–
700.
[37] Yi Luo and Nima Mesgarani. “Conv-tasnet: Surpassing ideal time–frequency
magnitude masking for speech separation”. In: IEEE/ACM TASLP 27.8 (2019),
pp. 1256–1266.
[38] J. Wang et al. “Tune-In: Training Under Negative Environments with Interference for Attention Networks Simulating Cocktail Party Effect”. In: AAAI. 2021.
[39] Shaojie Bai, J. Z. Kolter, and V. Koltun. “An Empirical Evaluation of Generic
Convolutional and Recurrent Networks for Sequence Modeling”. In: ArXiv
abs/1803.01271 (2018).
[40] M. W. Y. Lam et al. “Mixup-breakdown: a consistency training method for improving generalization of speech separation models”. In: ICASSP. IEEE. 2020.
[41] Neil Zeghidour and David Grangier. “Wavesplit: End-to-end speech separation
by speaker clustering”. In: arXiv preprint arXiv:2002.08933 (2020).
[42] Z. Shi et al. “Furcanet: An end-to-end deep gated convolutional, long shortterm memory, deep neural networks for single channel speech separation”. In:
arXiv preprint arXiv:1902.00651 (2019).
[43] Y. Luo, Z. Chen, and T. Yoshioka. “Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation”. In: ICASSP. IEEE.
2020, pp. 46–50.
79[44] J. Chen, Q. Mao, and D. Liu. “Dual-path transformer network: Direct contextaware modeling for end-to-end monaural speech separation”. In: arXiv preprint
arXiv:2007.13975 (2020).
[45] Cem Subakan et al. “Attention is all you need in speech separation”. In: ICASSP.
IEEE. 2021, pp. 21–25.
[46] Y. Luo, C. Han, and N. Mesgarani. “Ultra-lightweight speech separation via
group communication”. In: ICASSP. IEEE. 2021.
[47] Yi Luo, Cong Han, and Nima Mesgarani. “Group Communication With Context Codec for Lightweight Source Separation”. In: IEEE/ACM TASLP 29 (2021),
pp. 1752–1761.
[48] M. WY Lam et al. “Sandglasset: A Light Multi-Granularity Self-Attentive Network for Time-Domain Speech Separation”. In: ICASSP. IEEE. 2021, pp. 5759–
5763.
[49] Alexandre Défossez et al. “Music source separation in the waveform domain”.
In: arXiv preprint arXiv:1911.13254 (2019).
[50] O. Ronneberger, P. Fischer, and T. Brox. “U-net: Convolutional networks for
biomedical image segmentation”. In: MICCAI. Springer. 2015, pp. 234–241.
[51] L Sifre and S Mallat. “Rigid-Motion Scattering for Image Classification. arXiv
2014”. In: arXiv preprint arXiv:1403.1687 ().
[52] Andrew G. Howard et al. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”. In: ArXiv abs/1704.04861 (2017).
[53] Mark Sandler et al. “Mobilenetv2: Inverted residuals and linear bottlenecks”.
In: CVPR. 2018, pp. 4510–4520.
[54] A. Howard et al. “Searching for mobilenetv3”. In: ICCV. 2019, pp. 1314–1324.
[55] Liang-Chieh Chen et al. “Deeplab: Semantic image segmentation with deep
convolutional nets, atrous convolution, and fully connected crfs”. In: TPAMI
40.4 (2017), pp. 834–848.
[56] Liang-Chieh Chen et al. “Rethinking atrous convolution for semantic image
segmentation”. In: arXiv preprint arXiv:1706.05587 (2017).
[57] Liang-Chieh Chen et al. “Encoder-decoder with atrous separable convolution
for semantic image segmentation”. In: ECCV. 2018, pp. 801–818.
[58] ITU-T Recommendation. “Perceptual evaluation of speech quality (PESQ): An
objective method for end-to-end speech quality assessment of narrow-band
telephone networks and speech codecs”. In: Rec. ITU-T P. 862 (2001).
[59] Cees H Taal et al. “An algorithm for intelligibility prediction of time–frequency
weighted noisy speech”. In: IEEE Transactions on Audio, Speech, and Language
Processing 19.7 (2011), pp. 2125–2136.
80[60] E. Vincent, R. Gribonval, and C. Févotte. “Performance measurement in blind
audio source separation”. In: IEEE transactions on audio, speech, and language processing 14.4 (2006), pp. 1462–1469.
[61] Trausti Kristjansson, Hagai Attias, and John Hershey. “Single microphone
source separation using high resolution signal reconstruction”. In: 2004 IEEE
International Conference on Acoustics, Speech, and Signal Processing. Vol. 2. IEEE.
2004, pp. ii–817.
[62] Aarthi M Reddy and Bhiksha Raj. “A minimum mean squared error estimator for single channel speaker separation”. In: Eighth International Conference on
Spoken Language Processing. 2004.
[63] Ameya N Deoras and A Hasegawa-Johnson. “A factorial HMM approach to
simultaneous recognition of isolated digits spoken by multiple talkers on one
audio channel”. In: ICASSP. Vol. 1. IEEE. 2004, pp. I–861.
[64] Stanley Smith Stevens, John Volkmann, and Edwin Broomell Newman. “A scale
for the measurement of the psychological magnitude pitch”. In: The journal of the
acoustical society of america 8.3 (1937), pp. 185–190.
[65] Brian R Glasberg and Brian CJ Moore. “Derivation of auditory filter shapes from
notched-noise data”. In: Hearing research 47.1-2 (1990), pp. 103–138.
[66] Yuxuan Wang, Arun Narayanan, and DeLiang Wang. “On training targets for
supervised speech separation”. In: IEEE/ACM transactions on audio, speech, and
language processing 22.12 (2014), pp. 1849–1858.
[67] Arun Narayanan and DeLiang Wang. “Ideal ratio mask estimation using deep
neural networks for robust speech recognition”. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. 2013, pp. 7092–7096.
[68] Mikkel N Schmidt and Rasmus K Olsson. “Single-channel speech separation
using sparse non-negative matrix factorization”. In: Ninth International Conference on Spoken Language Processing. 2006.
[69] Emad M Grais, Mehmet Umut Sen, and Hakan Erdogan. “Deep neural networks for single channel source separation”. In: ICASSP. IEEE. 2014, pp. 3734–
3738.
[70] Emad M Grais et al. “Single-channel audio source separation using deep neural
network ensembles”. In: Audio Engineering Society Convention 140. Audio Engineering Society. 2016.
[71] Po-Sen Huang et al. “Joint optimization of masks and deep recurrent neural
networks for monaural source separation”. In: IEEE/ACM TASLP 12 (2015),
pp. 2136–2147.
[72] Meng Li et al. “Multi-layer Attention Mechanism Based Speech Separation
Model”. In: IEEE 19th ICCT. 2019, pp. 506–509.
81[73] Sam Roweis. “One microphone source separation”. In: Advances in neural information processing systems 13 (2000).
[74] Ozgur Yilmaz and Scott Rickard. “Blind separation of speech mixtures via
time-frequency masking”. In: IEEE Transactions on signal processing 52.7 (2004),
pp. 1830–1847.
[75] Li Li and Hirokazu Kameoka. “Deep clustering with gated convolutional networks”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018, pp. 16–20.
[76] Yuzhou Liu and DeLiang Wang. “Causal Deep CASA for Monaural TalkerIndependent Speaker Separation”. In: IEEE/ACM transactions on audio, speech,
and language processing 28 (2020), pp. 2109–2118.
[77] Yann N Dauphin et al. “Language modeling with gated convolutional networks”. In: International conference on machine learning. 2017, pp. 933–941.
[78] S. Sabour, N. Frosst, and Geoffrey E Hinton. “Dynamic routing between capsules”. In: Advances in neural information processing systems. 2017, pp. 3856–3866.
[79] Peter Kabal. “TSP speech database”. In: McGill University, Database Version 1.0
(2002), pp. 09–02.
[80] John S Garofolo. “Timit acoustic phonetic continuous speech corpus”. In: Linguistic Data Consortium, 1993 (1993).
[81] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014).
[82] S. Venkataramani, C. Subakan, and P. Smaragdis. “Neural network alternatives
toconvolutive audio models for source separation”. In: MLSP. IEEE. 2017, pp. 1–
6.
[83] A. Gang, P. Biyani, and A. Soni. “Towards Automated Single Channel Source
Separation using Neural Networks”. In: arXiv preprint arXiv:1806.08086 (2018).
[84] P. Huang et al. “Deep learning for monaural speech separation”. In: ICASSP.
IEEE. 2014, pp. 1562–1566.
[85] Y. Sun et al. “Monaural source separation based on adaptive discriminative criterion in neural networks”. In: DSP. IEEE. 2017, pp. 1–5.
[86] S. Qin et al. “Graph Convolution-Based Deep Clustering for Speech Separation”. In: IEEE Access 8 (2020), pp. 82571–82580.
[87] Yannan Wang et al. “A gender mixture detection approach to unsupervised single-channel speech separation based on deep neural networks”. In:
IEEE/ACM Transactions on Audio, Speech, and Language Processing 25.7 (2017),
pp. 1535–1546.
82[88] Yuzhou Liu and DeLiang Wang. “A CASA approach to deep learning based
speaker-independent co-channel speech separation”. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2018,
pp. 5399–5403.
[89] Ali Feizollah et al. “Comparative study of k-means and mini batch k-means
clustering algorithms in android malware detection using network traffic analysis”. In: 2014 international symposium on biometrics and security technologies (ISBAST). IEEE. 2014, pp. 193–197.
[90] David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Tech. rep. Stanford, 2006.
[91] Christian Buchta et al. “Spherical k-means clustering”. In: Journal of statistical
software 50.10 (2012), pp. 1–22.
[92] Sujan Kumar Roy, Aaron Nicolson, and Kuldip K Paliwal. “DeepLPCMHANet: Multi-Head Self-Attention for Augmented Kalman Filter-Based
Speech Enhancement”. In: IEEE Access 9 (2021), pp. 70516–70530.
[93] Kuldip Paliwal, Kamil Wójcicki, and Belinda Schwerin. “Single-channel speech
enhancement using spectral subtraction in the short-time modulation domain”.
In: Speech communication 52.5 (2010), pp. 450–475.
[94] Jimmy Ba and R. Caruana. “Do Deep Nets Really Need to be Deep?” In: NIPS.
2014.
[95] G. E. Hinton, O. Vinyals, and J. Dean. “Distilling the Knowledge in a Neural
Network”. In: ArXiv abs/1503.02531 (2015).
[96] Emilio Parisotto, Jimmy Ba, and R. Salakhutdinov. “Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning”. In: CoRR abs/1511.06342 (2016).
[97] Ying Zhang et al. “Deep mutual learning”. In: CVPR. 2018, pp. 4320–4328.
[98] Yifang Yin et al. “Enhanced Audio Tagging via Multi-to Single-Modal TeacherStudent Mutual Learning”. In: AAAI. Vol. 35. 12. 2021, pp. 10709–10717.
[99] R. Masumura et al. “End-to-End Automatic Speech Recognition with Deep Mutual Learning”. In: APSIPA. IEEE. 2020, pp. 632–637.
[100] Jonathan Le Roux et al. “SDR–half-baked or well done?” In: ICASSP. IEEE. 2019,
pp. 626–630.
[101] Ryo Aihara et al. “Teacher-student deep clustering for low-delay single channel
speech separation”. In: ICASSP. IEEE. 2019, pp. 690–694.
[102] Jianping Gou et al. “Knowledge distillation: A survey”. In: IJCV 129.6 (2021),
pp. 1789–1819.
[103] Duc-Quang Vu, Ngan Le, and Jia-Ching Wang. “Teaching Yourself: A SelfKnowledge Distillation Approach to Action Recognition”. In: IEEE Access 9
(2021), pp. 105711–105723.
83[104] Duc-Quang Vu, Jia-Ching Wang, et al. “A Novel Self-Knowledge Distillation
Approach with Siamese Representation Learning for Action Recognition”. In:
VCIP. IEEE. 2021, pp. 1–5.
[105] Yi Luo, Zhuo Chen, and Nima Mesgarani. “Speaker-independent speech separation with deep attractor network”. In: IEEE/ACM TASLP 26.4 (2018), pp. 787–
796.
[106] L. Zhang et al. “Furcanext: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks”. In: MMM. Springer.
2020, pp. 653–665.
[107] Shrikant Venkataramani, Jonah Casebeer, and Paris Smaragdis. “End-to-end
source separation with adaptive front-ends”. In: ACSSC. IEEE. 2018, pp. 684–
688.
[108] Efthymios Tzinis, Zhepei Wang, and Paris Smaragdis. “Sudo rm-rf: Efficient
networks for universal audio source separation”. In: MLSP. IEEE. 2020, pp. 1–6.
[109] Kristen Grauman and Trevor Darrell. “The pyramid match kernel: Discriminative classification with sets of image features”. In: ICCV. Vol. 2. IEEE. 2005,
pp. 1458–1465.
[110] S. Lazebnik, C. Schmid, and J. Ponce. “Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories”. In: CVPR. Vol. 2. IEEE. 2006,
pp. 2169–2178.
[111] J Garofalo et al. Continuous Speech Recognition (CSR-I) Wall Street Journal (WSJ0)
news, complete. Linguistic Data Consortium, Philadelphia (1993).
[112] Efthymios Tzinis et al. “Two-step sound source separation: Training on learned
latent targets”. In: ICASSP. IEEE. 2020, pp. 31–35. |