使用bag-of-word特徵進行人臉與行為分析;Facial and Action Analysis using bag-of-word Features

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/77789

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/77789

Title:	使用bag-of-word特徵進行人臉與行為分析;Facial and Action Analysis using bag-of-word Features
Authors:	徐勝斌;Hsu, Sheng-Bin
Contributors:	資訊工程學系
Keywords:	表情辨識;行為辨識;跌倒偵測;臉部分析;Facial Expression Recognition;Action Recognition;Fall Detection;Facial Analysis
Date:	2018-08-23
Issue Date:	2018-08-31 14:56:35 (UTC+8)
Publisher:	國立中央大學
Abstract:	本論文提出兩個主題包含人臉與行為分析方法，這兩個主題分別有臉部資訊擷取、表情辨識、跌倒偵測與行為辨識。在臉部資訊擷取部分使用影像處理技術進行人臉特徵擷取並以視覺化方式呈現。包含偵測、擷取、儲存和檢索五種臉部特徵，這五種臉部特徵分別是人臉輪廓、臉頰膚色、法令紋、髮線和黑素細胞的資訊擷取，可用於臉部保養或醫美用途，使用者可以長期記錄這些資訊。表情是人與人之間非語言交流方式之一，例如病患因疼痛產生表情變化也是其中一種呈現給醫療人員重要資訊。因此也可適用於嬰兒照護上，父母可以長期記錄嬰兒的心情變化，例如尿布舒適度、身體狀況等都有可能影響心情。近來越來越多的研究人員開發靜態影像或連續影像的表情自動辨識演算法。然而，目前大多數的表情辨識方法都是利用制式的特徵擷取步驟來取得描述表情特徵。此外對於人類行為與動作辨識在近幾十年有許多相關技術與方法不斷的被提出。因輸入資料為連續影像，因此常擷取包含空間與時間資訊的特徵。而特徵表示大致可以分為兩大類，分別為全域表示法(global representation) 與局部表示法(local representation)用來描述人類行為的特徵資訊。全域表示法是擷取包含人類身體形狀或輪廓特性，因取得較多的行為資訊，使得效果不錯；但其缺點是需要對人體進行前景偵測並精確定位與追蹤，且對於不同視角、雜訊與遮蔽有一定的影響。行為識別對於老年人與小孩在生活上也有多方面應用，例如可以監測老人與小孩每天的特地動作的活動量，來達到健康生活的目的。另一方面，在家裡有許多家具，在這些地方小孩或許會進行跑、跳等危險行為，因此即時偵測與辨識可有效提醒監護人。針對表情與行為辨識，為了可以根據影像本身資訊自動地學習如何描述動態變化(包含時間與空間相關性)，我們提出的無監督單層網絡進行局部特徵擷取，此方式可以擷取表情的局部形狀和動態變化進而描述整體人臉表情變化。為了避免精確定位及前景偵測等問題，在局部表示法中可偵測影片空間與時間之興趣點(space-time interest point, STIP)，並假設這些興趣點為人類行為的最有訊息之區域擷取動態或靜態特徵來代表人類動作特性。相對於全域特徵描述，在局部表示法中具有較好的旋轉、平移和縮放的不變性，可以有效降低複雜背景、人體形狀完整和攝影機所帶來的影響。此方法也類似於bag-of-features特性，包含局部特徵擷取(local feature extraction)、字典生成(vocabulary generation)、特徵向量表示(feature vector represent)與池化(pooling)等，最後再使用SVM進行辨識。另一個常見的危害生命安全行為是跌倒，不管是因為自己行走不慎發生的跌倒，還是身體不適導致的跌倒，如果沒有及時發現很可能會錯失搶救的黃金時間。由於科技的進步，監控攝影機也隨處可見。因此，可以利用攝影機來監控行人有無發生異常狀態來避免錯失黃金救援時間的機會。本研究提出了多視角的流行空間學習方式，用來分別學習多個角度的行人正常行走狀態。用此方法可以用來辨別不同方向的行人是否發生異常事件。在訓練模組部分使用Locality Preserving Projections (LPP)來建立各個視角的行人模組，並使用Maximum Hasusdorff distance來量測是否發生跌倒事件。最後，在實驗章節將會針對以上主題使用多個資料庫進行正確性評估。;In this study, we proposed two topics include the methods of facial and action analysis. A visual face feature extraction scheme using image processing techniques and visualize report is proposed. Five visual face features, face contours, face colors, smile lines, hairlines, and melanocytes, were detected, extracted, stored, and retrieved for aesthetic medicine. The results of facial analysis information can be long-term recorded for user retrieve. Perception and production of facial emotion is a kind of nonverbal communication between man and man. For example, the face of pain patients will be exposed to the painful emotion. Alarm messages will be sent to medical staff for notification when the system detects emotion of pain from patients. On the other hand, emotion recognition can be applied record of baby′ mood. This application can long-term recorded mood change in every day and report for the parents. There are many reasons could be effect baby′ emotion, such as diaper comfort, physical condition, the temperature of room temperature, etc. Thus, more researchers increasingly interested in developing algorithms for automatic recognition of facial emotion in still image and videos. However, most existing methods for facial emotion recognition utilize off-the-shelf feature extraction methods for classiﬁcation. Recently, video based human motion analysis and recognition has attracted a great deal of attention due its potential application and wide usage in a variety of areas, such as video surveillance, human-computer interaction, video indexing, video surveillance, sport event analysis, customer attributes, and shopping behavior analysis, etc. Basically, either global or local visual features are used for human action analysis in many published methods. Generally, an action is considered as a volume of video data both in space and time domain. Global features own the global representation and discriminative power. However, they are sensitive to intra-class variation and deformation like the cluttering backgrounds and partial occlusion in action sequences. The accuracy rates will be impacted by the background distortion. Second, the local visual representation of actions is to extraction the local features from interested points in spatial and temporal domains. In addition to the likelihood of falling in this elderly group is relatively high and can be regarded as a life-threatening event. Behavioral analysis can be used to record the activity content for an elderly person or child in a particular area every day. In general, some action is very dangerous in some places for children, for example, run or jump, etc. In addition to monitoring the child’s dangerous behavior, the service can also record the normal behavior, includes wave, bend, walking, etc. Ensure that the elderly or children have enough activity content for healthy living in every day. In order to learn better features of spatiotemporal information for emotion and action representation. In our study, the proposed unsupervised single-layer networks are applied to automatically learn the local feature, which can explicitly depict appearance and dynamic variations caused by facial emotion and human action. To combination the properties of the local visual representation and learning based model automated to extract feature. The local visual representation robust to intra-class variability caused by scale, pose changes, and occlusion, etc. The learning based model and be able to avoid the handcrafted features computed from a local cuboid around interest points. This method is also similar to the bag-of-features feature, including local feature extraction, vocabulary generation, feature vector representation and pooling, etc. Finally, we use a non-linear SVM with RBF kernel to recognize the facial emotion (human action). In addition, falling can cause severe harm to senior citizens. The ideal time for rescuing is immediately after the fall. However, falls are not always detected immediately, therefore detection in real time, using video surveillance systems, could save human life. Nowadays, digital cameras have been installed everywhere. Human activity is monitored using cameras connected to intelligent programs. An alarm can be sent to the administrator when an abnormal event occurs. In this paper, a manifold multi-view-based learning algorithm is proposed for detecting falling events. This algorithm is able to detect people falling in any direction. First, walking patterns at a normal speed are modeled by a locality preserving projection (LPP). Since the duration of a fall cannot easily be segmented from a video, partial temporal windows are matched with the normal walking patterns. The Hausdorff distances are calculated for comparison. In the experiments, falls were effectively detected using the proposed method.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	284	View/Open

社群 sharing

Loading...