摘要(英) |
Baseball winning or losing is a complex and dynamic problem, which is affected by many factors, such as player performance, team strength, playing field, and so on. When analyzing such problems in the past, time series models were not used for analysis, so this study attempts to use this type of model for data analysis.
The data used in this study were obtained from the Baseball Reference website, comprising statistical data for pitchers and batters of each team from 2011 to 2022. After data preprocessing, the study focused on the data from 2013 to 2022, excluding the data from 2020. Subsequently, the data was segmented based on individual games. The main objective was to utilize historical game data to predict future games. The study then presents the test results, and analyzes and discusses the factors influencing the prediction outcomes.
In this study, three time series models, namely Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU), were employed for training and evaluating the results.
The final results were compared based on the presence or absence of feature selection, various model architectures, and data formats. Among them, the best-performing approach was using LSTM architecture without feature selection, where the model predicted the outcome of one game based on the previous six games. The accuracy achieved in this setting was around 57%, and the area under the ROC curve was around 52%. |
參考文獻 |
[1] R. Jia, C. Wong, and D. Zeng, “Predicting the major league baseball season,” 2013.
[2] C. S. Valero, “Predicting win-loss outcomes in mlb regular season games –a comparative study using data mining methods,” International Journal of Computer Science in Sport, vol. 15, no. 2, pp. 91–112, 2016.
[3] T. Elfrink and S. Bhulai, “Predicting the outcomes of mlb games with a machine learning approach,” 2018.
[4] S.-F. Li, M.-L. Huang, and Y.-Z. Li, “Exploring and selecting features to predict the next outcomes of mlb games,” Entropy, vol. 24, no. 2, 2022.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
[6] R. C. Staudemeyer and E. R. Morris, Understanding lstm – a tutorial into long short-term memory recurrent neural networks, 2019. arXiv: 1909.09586 [cs.NE].
[7] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014. arXiv: 1412.3555 [cs.NE].
[8] K. Cho, B. van Merrienboer, C. Gulcehre, et al., Learning phrase representations using rnn encoder-decoder for statistical machine translation, 2014. arXiv: 1406.1078 [cs.CL].
[9] Time series forecasting, https://www.tensorflow.org/tutorials/structured_data/time_series.
[10] 張志勇, 人工智慧 / 張志勇, 廖文華, 石貴平, 王勝石, 游國忠編著, chi, 二版. 新北市: 全華圖書股份有限公司, 2021, ISBN: 9789865039226.
[11] 良. 龍, AI 黃金時期正好學: TensorFlow 2 高手有備而來 / 龍良曲著 (深智;DM2103), chi, 初版. 臺北市: 深智數位, 2021, ISBN: 9789865501716.
[12] J. van der Westhuizen and J. Lasenby, The unreasonable effectiveness of the forget gate, 2018. arXiv: 1804.04849 [cs.NE].
[13] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Phys. Rev. E, vol. 69, p. 066 138, 6 Jun. 2004.
[14] B. C. Ross, “Mutual information between discrete and continuous data sets,” PLOS ONE, vol. 9, no. 2, pp. 1–5, Feb. 2014.
[15] Github—jldbc/pybaseball: Pull current and historical baseball statistics using python (statcast, baseball reference, fangraphs), https://github.com/jldbc/pybaseball.
[16] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2017. arXiv: 1412.6980 [cs.LG]. |