摘要(英) |
With the extensive use of virtualization technology, single physical machine is virtualized into several virtual machines to host multiple servers. However, virtualization will derive a single point of failure problem that is a physical machine failure affects all virtual machines running on it. On general-purpose computer has limited error detection mode because it is connected by internet. On the contrary, the ATCA (Advanced Telecommunications Computing Architecture) physical machines provide high hardware availability, and support IPMI (Intelligent Platform Management Interface) that can quickly detect the hardware status. Under these two conditions, we hope to construct a set of integrity High Availability (HA) cluster of fault-tolerant systems. They operate by using high availability software to harness redundant computers in groups or clusters that provide continued service when system components fail. ATCA hardware acceleration server error detection ability will quickly detect misclassification and find the corresponding response mechanisms to reduce service down time. In order to make our system available on-line use, we use the software engineering technology methodology. On the analysis part, we add lack of complementary parts of the system. On the design part, we use structured system design. On the implement part, we follow the code review method on code readable. On the testing part, we not only implement unit test on important function, also assist test team automatic testing. To reach a complete HA cluster of fault-tolerant systems and quality guaranty. |
參考文獻 |
[1] A. Oliner and J. Stearley, “What Supercomputers Say: A Study of Five System Logs,” in 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. DSN ’07, pp. 575–584, 2007.
[2] W. Feng, “Making a Case for Efficient Supercomputing,” Queue, vol. 1, no. 7, pp. 54–64, Oct. 2003.
[3] C.-D. Lu, Scalable Diskless Checkpointing for Large Parallel Systems. University of Illinois at Urbana-Champaign, 2005.
[4] J. Lang, M. Liu, Q. Wang, W. Kuehn, Z. Liu, and H. Xu, “Intelligent Platform Management Controller for ATCA Compute Nodes,” in Real Time Conference, 2009. RT ’09. 16th IEEE-NPSS, pp. 35–37, 2009.
[5] P. Lewis, "A High Availability Clustering Solution," Linux Journal, vol. 1999, 1999.
[6] J. Lang, M. Liu, Q. Wang, W. Kuehn, Z. Liu, and H. Xu, “Intelligent Platform Management Controller for ATCA Compute Nodes,” in Real Time Conference, 2009. RT ’09. 16th IEEE-NPSS, pp. 35–37, 2009.
[7] P. Perek, D. Makowski, P. Predki, and A. Napieralski, “ATCA carrier board with dedicated IPMI controller,” in Mixed Design of Integrated Circuits and Systems (MIXDES), 2010 Proceedings of the 17th International Conference, pp. 139–143, 2010.
[8] Ketchum, Breton A., and Viswa N. Sharma. "Shelf management controller with hardware/software implemented dual redundant configuration." U.S. Patent No. 7,827,442. 2 Nov. 2010.
[9] Zawada, A., et al. "ATCA Carrier Board with IPMI supervisory circuit." Mixed Design of Integrated Circuits and Systems, 2008. MIXDES 2008. 15th International Conference on. IET, 2008.
[10] MURATA,T . "Petri Net:Properties,Analysis and Application,"In Proc. Of the IEEE,vil.77,No.4,Apr.1989
[11] C.-D. Lu, Scalable Diskless Checkpointing for Large Parallel Systems. University of Illinois at Urbana-Champaign, 2005.
|