dc.description.abstract | Novel coronavirus (COVID-19) disease is an infectious disease caused by the SARS-CoV-2 virus. COVID-19 originated at Wuhan city of China in early December 2019, and the epidemic quickly spread to the world. It is a Human-to-human transmission. SARS-CoV-2 spreads rapidly and is prone to severe symptoms after infection. It has led to a great impact on the world. In the absence of an adequate vaccine, significant medical resources and policies to limit human movement and contact such as restriction on gathering will be needed to mitigate the epidemic. Policies to reduce the spread of SARS-CoV-2 include border controls, mandatory or voluntary lock-down, quarantines, social distancing, mask-wearing, and vaccination. These measures are effective by restricting human movement and contact; however, the economy is seriously impacted as well. We focus on exploring the optimal balance between policy stringency and economy using Reinforcement Learning (RL): Asynchronous Advantage Actor-Critic + Proximal Policy Optimization. We use the compartmental SEIR model to train the agent and adjust the parameters of each state: suspected, exposed, infected, and removed. The parameters of these four states make the basic reproduction number in the SEIR correspond with the basic reproduction number of COVID-19 . In the experiment, we focus on the four prefectures in Japan – Hokkaido, Okinawa, Osaka, and Tokyo – and use the tested positive cases data from January 2020 to October 2021. There are five infection peaks in the data. For the compartmental SEIR model, it is difficult to make the whole picture simulations directly like the real situation. Hence, we create five environments to simulate these peaks then use an optimally trained agent to interact with these environments to reach the goal. We use CPU: i9-10980XE with 18 cores and 36 threads & GPU: RTX 3090 2ith 24GB to train the agent. With 18 workers for multi-threading on the A3C during training, the average reward rises with training and plateaus after 500 episodes. The results show that the optimal agent can effectively suppress the increase in the active cases. We also find the agent implement strict policies when the number of infected cases increase, continue increasing for a few days, or remain unchanged. These strict policies are implemented in high-risk areas on average. Finally, weighted population density can better represent the density of population in the area compared to traditional population density, hence it is more accurate to use population weighted density for pandemic infectivity studies. We change the SEIR model and add the Quarantined (Q) to form SEIQR model. Learned from the experiment that we can simulate various situations and various epidemic diseases by changing traditional SEIR model. However, whether our trained agent can be generally used in different epidemic diseases depends on the states that the environment gave. If we can generalize these states from the different epidemiological environment to find the necessary and crucial information which is sufficient for the agent to judge whether to implement strict policies, we can construct a general epidemiologically reward function with this information and train the agent to apply it to different epidemic diseases. | en_US |