本研究目的在建立一具六自由度夾取操作能力之機械手臂系統,能於雜亂堆疊場景中進行物件辨識與自動夾取之任務,系統整合YOLOv11進行影像實例分割,藉由深度相機擷取之 RGB-D資料且進行深度影像前處理後,透過紋理映射方式生成特徵點雲,並且為克服單視角遮蔽與深度破洞等問題,本系統採用多視角拍攝與點雲疊合技術,經由手眼校正與ICP(Iterative Closest Point)演算法進行工作環境與機械手臂之三維場景關係建構。 在夾取姿態生成方面,採用Grasp Pose Detection(GPD)方法,針對預處理後之點雲進行候選姿態採樣與評分,並結合物件分割結果篩選可行之夾取點與分類資訊。整體系統以ROS為通訊架構整合各模組,實現即時感知、判斷與動作控制,實驗設計涵蓋堆疊工件分類、未知雜物處理等場景,評估系統在不同堆疊情境下之辨識正確率、夾取成功率與任務穩定性,實驗結果顯示,本系統針對堆疊物件分類自動夾取任務成功率達89.33%,並在後續添加未知雜物之堆疊物件分類夾取成功率則為77.46%,除了可針對特定物體進行揀選,亦具備處理遮蔽、混雜與未見工件之泛化能力。;This study aims to develop a robotic arm system capable of six-degree-of-freedom (6-DoF) grasping operations for object recognition and autonomous manipulation in cluttered,stacked environments. The proposed system integrates YOLOv11 for instance segmentation and utilizes RGB-D data captured by a depth camera. After applying depth preprocessing, texture mapping is performed to generate feature-rich point clouds. To address occlusion and depth incompleteness caused by single-view limitations, the system incorporates multi-view scanning and point cloud fusion techniques. Through eye-in-hand calibration and the Iterative Closest Point (ICP) algorithm, it establishes the spatial relationship between the robot and its 3D environment.
For grasp pose generation, the system adopts the Grasp Pose Detection (GPD) method to sample and evaluate candidate grasp poses on the preprocessed point cloud. Feasible grasps are then selected by combining segmentation results and semantic object classification. The entire system is integrated using the Robot Operating System (ROS) framework, enabling real-time perception, decision-making, and motion control. Experiments are designed for scenarios including stacked object classification and unknown clutter handling, evaluating recognition accuracy, grasp success rate, and task stability under various conditions. Results show that the system achieves an 89.33% success rate in automatic grasping and sorting of stacked objects, and 77.46% in cluttered scenes containing unknown objects. The system demonstrates not only the ability to selectively pick target items, but also strong generalization in handling occlusion, mixed categories, and previously unseen objects.