dc.description.abstract | According to rising of deep learning technology, its application in object detection and recognition gradually mature recently. Object detection technology has gradually developed to the 3D application. For example, self-driving cars, virtual reality, augmented reality, and robotic arms. 3D images have depth information, but 2D images haven’t. 3D object detection becomes more difficult due to the depth data. For example, depth image features extracted effectively, complex high-dimensional data handled, object occluded each other, scenes clutter, etc. In our research, we propose a convolution neural network (CNN) that can estimate directly the position and size of 3D objects. After input RGB and depth images extracts features, model outputs 3D bounding boxes.
In our research, model adapted from the famous 2D detection network YOLOv3. We made two improvements of model. First, we modify the input which use RGB and depth images. We use channel attention to enhance the ability to extract features. These features used for multi-scale detection and identify. Second, we estimated the 3D translation by localizing object center in the image and estimating distance object distance from the camera. We add quaternion to the loss function that can estimate the 3D rotation. Our model can predict 3D bounding box which contain the object class, 3D coordinate, position and size.
In the experiment, we modified YOLOv3 to 6DoF YOLO which can predict the 3D bounding box. There are 20854 images in (Falling Thing) dataset, 90% of which are training data and the others are test data. 6DoF YOLO get 89.33% mAP. After experimental analysis, we finally use the 6DoF SE-YOLO architecture. This architecture increases the parameter calculation amount by 1.014 times and 1.002 times, respectively. Our model can reach 93.59% mAP, and the average execution speed on 416×416 images is 35 frames per second. | en_US |