Sparse Butterfly Matrix Attention for Enhancing Transformer Performance

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：169

、訪客IP：3.131.37.193

姓名

邱柏瑋(Po-Wei Chiu) 查詢紙本館藏

畢業系所

資訊工程學系

論文名稱

(Sparse Butterfly Matrix Attention for Enhancing Transformer Performance)

相關論文

★ Dynamic Overlay Construction for Mobile Target Detection in Wireless Sensor Networks	★ 車輛導航的簡易繞路策略
★ 使用傳送端電壓改善定位	★ 利用車輛分類建構車載網路上的虛擬骨幹
★ Why Topology-based Broadcast Algorithms Do Not Work Well in Heterogeneous Wireless Networks?	★ 針對移動性目標物的有效率無線感測網路
★ 適用於無線隨意網路中以關節點為基礎的分散式拓樸控制方法	★ A Review of Existing Web Frameworks
★ 將感測網路切割成貪婪區塊的分散式演算法	★ 無線網路上Range-free的距離測量
★ Inferring Floor Plan from Trajectories	★ An Indoor Collaborative Pedestrian Dead Reckoning System
★ Dynamic Content Adjustment In Mobile Ad Hoc Networks	★ 以影像為基礎的定位系統
★ 大範圍無線感測網路下分散式資料壓縮收集演算法	★ 車用WiFi網路中的碰撞分析

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2025-7-17以後開放)

摘要(中)

隨著處理長序列數據需求的增加，許多研究集中於提高模型性能和效率。在這項研究中，我們提出了一種基於 Transformer 架構的高效模型——Butterflyer。Butterflyer 旨在處理圖像、文本、路徑數據和數學運算等長序列數據。我們引入了一種創新的 ButterflyAttention 機制，該機制利用 Butterfly 矩陣的計算來替代傳統的自注意力機制，從而計算和捕捉不同的交互模式，提高了模型的性能。通過採用 Sophia 優化器，我們進一步提升了 Butterflyer 的訓練效率和性能。我們使用 Long Range Arena (LRA)基準數據集來評估 Butterflyer 的性能。實驗結果顯示，Butterflyer 在各種應用中表現出色，尤其在文本分類、圖像分類和文檔檢索等任務中，超越了目前最先進的模型。

摘要(英)

With the increasing demand for handling long sequential data, many studies have focused on improving model performance and efficiency. In this research, we propose Butterflyer, an efficient model based on the Transformer architecture. Butterflyer is designed to process long sequential data such as images, text, path data, and mathematical operations. We introduce an innovative Butterfly-Attention mechanism that replaces the traditional self-attention mechanism, utilizing Butterfly Matrices to compute and capture different interaction patterns, thereby enhancing computational efficiency and model performance. By employing the Sophia optimizer, we further improve the training efficiency and performance of Butterflyer. We evaluate Butterflyer’s performance using the Long Range Arena (LRA) benchmark dataset. Experimental results show that Butterflyer demonstrates superior performance across various applications, particularly in tasks such as text classification, image classification, and document retrieval, outperforming state-of-the-art models.

關鍵字(中)

★ 深度學習

關鍵字(英)

★ DeepLearning

論文目次

Contents
1 Introduction 1
2 Related Work 5
2.1 Attention mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Human-Computer Interaction . . . . . . . . . . . . . . . . . . . . . 6
2.2 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Low-Rank Matrix Factorization . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Sparse Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Preliminary 9
3.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Self-Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Feed-Forward Network . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Sparse Attention with Paramixer . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Butterfly Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Design 15
4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Proposed System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3.2 Butterflyer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Performance 23
5.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Dataset Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Experimental Results and Analysis . . . . . . . . . . . . . . . . . . . . . . 26
5.4.1 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.4.2 Model Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.5 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6 Conclusion 31

參考文獻

[1] Joshua Ainslie, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula, Sumit Sanghai, Qifan Wang, and Li Yang. Etc: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483, 2020.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[3] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
[4] Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and TatSeng Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[5] R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
[6] R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
[7] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, and T Sarlos.
Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
[8] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
[9] Gon¸calo M. Correia, Vlad Niculae, and Andr´e FT Martins. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019.
[10] T. Dao, A. Gu, M. Eichhorn, A. Rudra, and C. R´e. Learning fast algorithms for linear transforms using butterfly factorizations. In International conference on machine learning, pages 1517–1527. PMLR, 2019.
[11] G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. Numerische Mathematik, 14(5):403–420, 1970.
[12] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra.
Draw: A recurrent neural network for image generation. In Proceedings of The 32nd International Conference on Machine Learning, 2015.
[13] Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In European Conference on Computer Vision, pages 668–685. Springer Nature Switzerland, 2022.
[14] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
[15] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020.
[16] R. Khalitov, T. Yu, L. Cheng, and Z. Yang. Sparse factorization of square matrices with application to neural attention modeling. Neural Networks, 152:160–168, 2022.
[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[18] N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
[19] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
[20] Daniel Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization.
In Advances in neural information processing systems, volume 13, pages 556–562, 2001.
[21] J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontanon. Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824, 2021.
[22] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, and Y.-X. et al. Wang. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting.
In Advances in neural information processing systems, volume 32, 2019.
[23] Zhuwen Li, Qifeng Chen, and Vladlen Koltun. Interactive image segmentation with latent diversity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6879–6888, 2018.
[24] D. Linsley, J. Kim, V. Veerabadran, C. Windolf, and T. Serre. Learning long-range spatial dependencies with horizontal gated recurrent units. In Advances in neural information processing systems, volume 31, 2018.
[25] H. Liu, Z. Li, D. Hall, P. Liang, and T. Ma. Sophia: A scalable stochastic secondorder optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
[26] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
[27] A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, June 2011.
[28] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, 2011. Association for Computational Linguistics.
[29] Michael W. Mahoney and Petros Drineas. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702,2009.
[30] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. In Advances in Neural Information Processing Systems,2014.
[31] N. Nangia and S. R. Bowman. Listops: A diagnostic dataset for latent tree learning.arXiv preprint arXiv:1804.06028, 2018.
[32] A. Prabhu, A. Farhadi, and M. et al. Rastegari. Butterfly transform: An efficient fft-based neural architecture design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12024–12033, 2020.
[33] Rui Qian, Robby T. Tan, Wenhan Yang, Jiajun Su, and Jiaying Liu. Attentive generative adversarial network for raindrop removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[34] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. Cosformer: Rethinking softmax in attention.
arXiv preprint arXiv:2202.08791, 2022.
[35] S. Sapkota and B. Bhattarai. Dimension mixer: A generalized method for structured sparsity in deep neural networks. arXiv preprint arXiv:2311.18735, 2023.
[36] Suman Sapkota and Binod Bhattarai. Dimension mixer: A generalized method for structured sparsity in deep neural networks. arXiv preprint arXiv:2311.18735, 2023.
[37] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
[38] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3531–3539, 2021.
[39] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM computer communication review, 31(4):149–160, 2001.
[47] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
[48] Christopher K. I. Williams and Matthias Seeger. Using the nystrom method to speed up kernel machines. In Advances in neural information processing systems, pages 682–688, 2001.
[49] Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. Nystr¨omformer: A nystr¨om-based algorithm for approximating self-attention. arXiv preprint arXiv:2102.03902, 2021.
[50] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
[51] T. Yu, R. Khalitov, L. Cheng, and Z. Yang. Paramixer: Parameterizing mixing links in sparse factors works better than dot-product self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 691–700, 2022.
[52] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer
sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
[53] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354–7363, 2019.

指導教授

孫敏德(Min-Te Sun)

審核日期

2024-7-22

推文