;In recent years, with the advancement of GPU and the advent of the era of big data, deep learning has replaced previous algorithms, bringing revolutionary progress to various fields, such as face detection, face recognition, image segmentation and speech recognition. However, limited by the power consumption and the cost of GPUs, it is difficult for us to execute computationally intensive neural networks on the edge devices. This has prompted a lot of research in recent years to focus on the light-weighting of neural networks and hardware acceleration of digital circuits. Convolution layers are the most computationally expensive part of the convolutional neural network (CNN) at inference stage. Many architectures have been designed to deal with it efficiently. Among those designs, systolic arrays with highly pipelined structures are able to accelerate general matrix-matrix multiplication (GEMM) efficiently. However, in order to process the convolution in the form of the GEMM, each layer requires Image to Column (IM2COL) preprocessing, which needs larger internal memory and repeated memory access. This paper proposes an extended architecture that includes proposed IM2COL circuits and systolic arrays to maximize data reuse. The aims are based on memory reduction method and hardware architecture design to solve the problem of high memory access in GEMM based CNN accelerator. We design a data reorganization transformation unit (TU) for the GEMM unit to reduce memory access of the redundant data generated by IM2COL. Besides, we introduce the stripe based dataflow to improve the memory need of the proposed TU. With the proper data reuse, the TU can save around 87% of the memory access. The prototype of our proposed accelerator comprising of 1024 Multiply–Accumulate (MAC) units can achieve 512 GOPS. The parallelization can be configured according to the availability of different hardware resources and performance limitations.