參考文獻 |
[1] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by
back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986.
[2] M. Jaderberg, W. M. Czarnecki, S. Osindero, et al., “Decoupled neural interfaces
using synthetic gradients,” in International conference on machine learning, PMLR,
2017, pp. 1627–1635.
[3] W. M. Czarnecki, G. Świrszcz, M. Jaderberg, S. Osindero, O. Vinyals, and K.
Kavukcuoglu, “Understanding synthetic gradients and decoupled neural interfaces,”
in International Conference on Machine Learning, PMLR, 2017, pp. 904–912.
[4] D.-H. Lee, S. Zhang, A. Fischer, and Y. Bengio, “Difference target propagation,” in
Machine Learning and Knowledge Discovery in Databases: European Conference,
ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I
15, Springer, 2015, pp. 498–515.
[5] D. Y. Wu, D. Lin, V. Chen, and H.-H. Chen, “Associated learning: An alternative to end-to-end backpropagation that works on cnn, rnn, and transformer,” in
International Conference on Learning Representations, 2021.
[6] Y.-W. Kao and H.-H. Chen, “Associated learning: Decomposing end-to-end backpropagation based on autoencoders and target propagation,” Neural Computation,
vol. 33, no. 1, pp. 174–193, 2021.
[7] C.-Y. Chuang, J. Robinson, Y.-C. Lin, A. Torralba, and S. Jegelka, “Debiased
contrastive learning,” Advances in neural information processing systems, vol. 33,
pp. 8765–8775, 2020.
[8] C.-K. Wang, “利用 scpl 分解端到端倒傳遞演算法,” M.S. thesis, National Central
University, 2022.
[9] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl,
“Measuring the effects of data parallelism on neural network training,” arXiv preprint
arXiv:1811.03600, 2018.
[10] T. Vogels, S. P. Karimireddy, and M. Jaggi, “Powersgd: Practical low-rank gradient compression for distributed optimization,” Advances in Neural Information
Processing Systems, vol. 32, 2019.
35
參考文獻
[11] Y. Huang, Y. Cheng, A. Bapna, et al., “Gpipe: Efficient training of giant neural
networks using pipeline parallelism,” Advances in neural information processing
systems, vol. 32, 2019.
[12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[13] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proceedings of the IEEE conference on computer vision and pattern recognition,
2016, pp. 770–778.
[14] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE
transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,
vol. 9, no. 8, pp. 1735–1780, 1997.
[16] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances
in neural information processing systems, vol. 30, 2017.
[17] D. Narayanan, A. Harlap, A. Phanishayee, et al., “Pipedream: Generalized pipeline
parallelism for dnn training,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, 2019, pp. 1–15.
[18] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro,
“Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. |