參考文獻 |
[1] Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. Etc: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483, 2020.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
[4] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and TatSeng Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[5] R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
[6] R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
[7] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, and T Sarlos.
Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
[8] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
[9] Gon¸calo M. Correia, Vlad Niculae, and Andr´e FT Martins. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019.
[10] T. Dao, A. Gu, M. Eichhorn, A. Rudra, and C. R´e. Learning fast algorithms for linear transforms using butterfly factorizations. In International conference on machine learning, pages 1517–1527. PMLR, 2019.
[11] G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. Numerische Mathematik, 14(5):403–420, 1970.
[12] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra.
Draw: A recurrent neural network for image generation. In Proceedings of The 32nd International Conference on Machine Learning, 2015.
[13] Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In European Conference on Computer Vision, pages 668–685. Springer Nature Switzerland, 2022.
[14] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
[15] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
[16] R. Khalitov, T. Yu, L. Cheng, and Z. Yang. Sparse factorization of square matrices with application to neural attention modeling. Neural Networks, 152:160–168, 2022.
[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[18] N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
[19] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
[20] Daniel Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization.
In Advances in neural information processing systems, volume 13, pages 556–562, 2001.
[21] J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon. Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824, 2021.
[22] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, and Y.-X. et al. Wang. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting.
In Advances in neural information processing systems, volume 32, 2019.
[23] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation with latent diversity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6879–6888, 2018.
[24] D. Linsley, J. Kim, V. Veerabadran, C. Windolf, and T. Serre. Learning long-range spatial dependencies with horizontal gated recurrent units. In Advances in neural information processing systems, volume 31, 2018.
[25] H. Liu, Z. Li, D. Hall, P. Liang, and T. Ma. Sophia: A scalable stochastic secondorder optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
[26] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
[27] A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, June 2011.
[28] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, 2011. Association for Computational Linguistics.
[29] Michael W. Mahoney and Petros Drineas. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702,2009.
[30] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. In Advances in Neural Information Processing Systems,2014.
[31] N. Nangia and S. R. Bowman. Listops: A diagnostic dataset for latent tree learning.arXiv preprint arXiv:1804.06028, 2018.
[32] A. Prabhu, A. Farhadi, and M. et al. Rastegari. Butterfly transform: An efficient fft-based neural architecture design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12024–12033, 2020.
[33] Rui Qian, Robby T. Tan, Wenhan Yang, Jiajun Su, and Jiaying Liu. Attentive generative adversarial network for raindrop removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[34] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. Cosformer: Rethinking softmax in attention.
arXiv preprint arXiv:2202.08791, 2022.
[35] S. Sapkota and B. Bhattarai. Dimension mixer: A generalized method for structured sparsity in deep neural networks. arXiv preprint arXiv:2311.18735, 2023.
[36] Suman Sapkota and Binod Bhattarai. Dimension mixer: A generalized method for structured sparsity in deep neural networks. arXiv preprint arXiv:2311.18735, 2023.
[37] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
[38] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3531–3539, 2021.
[39] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM computer communication review, 31(4):149–160, 2001.
[47] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
[48] Christopher K. I. Williams and Matthias Seeger. Using the nystrom method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
[49] Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr¨omformer: A nystr¨om-based algorithm for approximating self-attention. arXiv preprint arXiv:2102.03902, 2021.
[50] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
[51] T. Yu, R. Khalitov, L. Cheng, and Z. Yang. Paramixer: Parameterizing mixing links in sparse factors works better than dot-product self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 691–700, 2022.
[52] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer
sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
[53] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354–7363, 2019. |