Multi-Target Multi-Camera Tracking and Reidentification with Artificial Neural Networks and Spatial-Temporal Information

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：82

、訪客IP：3.139.107.229

姓名

許又升(Yu-Sheng Hsu) 查詢紙本館藏

畢業系所

土木工程學系

論文名稱

(Multi-Target Multi-Camera Tracking and Reidentification with Artificial Neural Networks and Spatial-Temporal Information)

相關論文

★ 物聯網制動功能之互操作性解決方案	★ 地理網路爬蟲：具擴充及擴展性之地理網路資源爬行架構
★ TDR監測資訊平台之改善與感測器觀測服務之建立	★ 利用高解析衛星立體像對產製近岸水底地形
★ 整合oneM2M 及OGC SensorThings API 標準建立開放式物聯網架構	★ 巨量物聯網資料之多重屬性索引架構
★ 高效率異質性時序資料表示法辨別系統	★ A TOA-reflectance-based Spatial-temporal Image Fusion Method for Aerosol Optical Depth Retrieval
★ An Automatic Embedded Device Registration Procedure for the OGC SensorThings API	★ 基於本體論與使用者興趣之個人化地理網路搜尋引擎
★ 利用本體論整合城市模型及物聯網開放式標準探討智慧城市之應用	★ 運用無人機及影像套合法進行混凝土橋梁裂縫檢測
★ GeoRank: A Geospatial Web Ranking Algorithm for a GeoWeb Search Engine	★ 應用高時空解析度遙測影像融合於海水覆蓋率之監測
★ LoRaWAN Positioning based on Time Difference of Arrival and Differential Correction	★ 類神經網路逆向工程理解遙測資訊：以Landsat 8植被分類為例

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-1-1以後開放)

摘要(中)

監視攝影機在交通監控、商業與居家安全以及犯罪偵查扮演重要的角色。而監視攝
影機連續獲取影像資料後，仍須由人力解讀其中的資訊，如物件識別、情境語意或是物
件位置追蹤等等，此過程效率差且成本高。本研究欲針對監視影像中攝影機之物件追蹤
設計一自動化且低成本的方法。
本研究的流程大致分為偵測、追蹤、識別三個部分。偵測即在監視影像中尋找前景
物件。本研究中採用高斯混合法（Mixture of Gaussian），建立各監視影像背景模型獲得
前景，再利用型態學（Morphology）方法去除雜訊並分離出單一影像中的前景物件。追
蹤為在連續影像中標示出同一物件。為避免畫面震盪、雜訊遮擋等影響，本研究使用 RE
3
類神經網路追蹤器，透過其長短期記憶模型（Long short-term memory, LSTM），使追蹤
器更加穩定並取得單一監視影像中同一物體的影像邊界框。識別是在不同影像中判斷同
一物件。本研究中使用卷積神經網路萃取物件特徵以及物件的時空間資訊作為軟性生物
特徵。透過將高維的影像資料降維至單一維度的特徵資訊，進而比較兩張物件影像之相
似程度以達成識別。而時空間特徵，本研究以人工選取攝影機中的控制點，將不同攝影
機的影像坐標投影至統一的坐標系統。對於出現在同一時間點的物件，若其在統一坐標
系統中距離持續接近，則視其為同一類別。對於出現在不同時間點之物件，本研究設計
一時空合理性函數，考量兩物件之時間差、距離以及物件移動速度，計算出兩物件為同
一類別之合理程度，作為外觀匹配之候選條件。另外，藉由比對攝影機方向以及物件移
動方向，可以求得攝影機拍攝物件之面向，作為外觀特徵匹配的候選條件。
為驗證所提之方法，本研究使用 Unity 遊戲引擎建立一虛擬的辦公室場景，包含 7
台固定攝影機以及兩組各 6 個物件在場景中沿固定路線並以固定速率移動做為資料集。
本研究分別對追蹤成果計算路徑誤差以及對再識別成果計算類別一致性。以結果而言，
單攝影機物件追蹤誤差約在 1m 左右，而根據物件幀數進行權重平均之誤差為 0.8m，而
以分類結果所得之多攝影機追蹤誤差約在 2-3m 左右。再識別之類別一致性達 80%，代表
同類別中 80%為同一物件。透過本研究所提方法，能夠達成不同監視攝影機之連結進行
跨攝影機之物件追蹤，預期將可於保全、防災、無人商店、智慧城市等領域有效應用。

摘要(英)

Closed-circuit Television (CCTV) has been widely used in various applications such as
security control, traffic monitoring, missing people finding or unmanned stores. CCTV systems
provide real-time video feeds that usually require human interpretation to extract information,
which is expensive and inefficient. This research aims at designing a framework to
automatically extract locations of moving targets from CCTV systems. This framework
includes three main steps: Detection, Tracking and Reidentification. For the Detection, we use
the mixture of gaussians (MOG) method and morphology enhancement to separate the
foreground from the background. Afterward, we initialize a RE3
(Real-Time Recurrent
Regression) tracker to track each stable object detected from the MOG foreground. The tracker
continuously outputs bounding boxes of an object, that provide two major information: object
image crops and object foot locations. To classify the identity of objects (i.e., Reidentification),
we first apply the Geo-Matching that compares the object foot locations detected by different
cameras to link objects in these cameras together. In the meantime, we use the VGG16 to extract
the feature embedding from the object image crops, which will be applied to match with known
classes via the cosine similarity. In addition, to improve feature matching performance and
avoid wrong matches, we use the object’s foot locations, moving velocity and last locations of
known classes to estimate the spatial-temporal rationality of a correct match for each class.
Furthermore, the moving directions of an object help estimate the captured object’s aspects in
the image crops, which serve as a constraint to select suitable candidate classes’ images that
have similar aspects to improve the feature matching accuracy. In terms of the testing dataset,
we simulate a relatively ideal environment that is an office with 2 sets of 6 moving objects and
7 cameras in Unity, where high-definition videos were obtained without noises. As a result, the
proposed solution reaches 1m of single-camera object tracking error, 2-3m of multi-camera
multi-target object tracking error and over 80% of classification consistency. By this research,
we can further develop applications in public surveillance, disaster prevention, unmanned store
and smart city.

關鍵字(中)

★ 物件追蹤
★ 物件再識別
★ 類神經網路
★ 監視攝影機

關鍵字(英)

★ Object Tracking
★ Reidentification
★ Artificial Neural Network
★ CCTV

論文目次

Contents
摘要.............................................................................................................................................i
Abstract.......................................................................................................................................ii
List of Figures............................................................................................................................iv
List of Tables .............................................................................................................................vi
Chapter 1 Introduction ............................................................................................................1
1-1 Motivation ....................................................................................................................1
1-2 Literature Review.........................................................................................................2
Chapter 2 Methodology...........................................................................................................6
2-1 Detection ......................................................................................................................6
2-2 Track.............................................................................................................................8
2-3 Reidentification ............................................................................................................9
Chapter 3 Results & Evaluations ..........................................................................................15
3-1 Experiments................................................................................................................15
3-2 Results........................................................................................................................17
3-3 Evaluations.................................................................................................................19
3-4 Discussion ..................................................................................................................35
Chapter 4 Conclusions ..........................................................................................................39
Chapter 5 Future Work ..........................................................................................................40
References ................................................................................................................................41

參考文獻

[1] Albawi, S., Mohammed, T. A., & Al-Zawi, S. (2017, August). Understanding of a
convolutional neural network. In 2017 International Conference on Engineering and
Technology (ICET) (pp. 1-6). Ieee.
[2] Danelljan, M., Häger, G., Khan, F., & Felsberg, M. (2014). Accurate scale estimation
for robust visual tracking. In British Machine Vision Conference, Nottingham,
September 1-5, 2014. BMVA Press.
[3] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and
pattern recognition (pp. 248–255).
[4] Denman, S., Bialkowski, A., Fookes, C., & Sridharan, S. (2011). Determining
operational measures from multi-camera surveillance systems using soft biometrics.
In 2011 8th IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS) (pp. 462-467). IEEE.
[5] Deza, M. M., & Deza, E. (2006). Dictionary of distances. Elsevier.
[6] Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische
Mathematik, 1(1), 269–271.
[7] G. R. Bradski, "Real time face and object tracking as a component of a perceptual user
interface," Proceedings Fourth IEEE Workshop on Applications of Computer Vision.
WACV′98 (Cat. No.98EX201), Princeton, NJ, USA, 1998, pp. 214-219, doi:
10.1109/ACV.1998.732882.
[8] Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for
accurate object detection and semantic segmentation. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 580-587).
[9] Gordon, D., Farhadi, A., & Fox, D. (2018). Re3: Real-Time Recurrent Regression
Networks for Visual Tracking of Generic Objects. IEEE Robotics and Automation
Letters, 3(2), 788-795.
[10] Granstrom, K., Baum, M., & Reuter, S. (2016). Extended object tracking: Introduction,
overview and applications. arXiv preprint arXiv:1604.00970.
[11] Guttman, A. (1984, June). R-trees: A dynamic index structure for spatial searching.
In Proceedings of the 1984 ACM SIGMOD international conference on Management
of data (pp. 47-57).
42
[12] Hashemi, N. S., Aghdam, R. B., Ghiasi, A. S. B., & Fatemi, P. (2016). Template
matching advances and applications in image analysis. arXiv preprint arXiv:1610.07231.
[13] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 770-778).
[14] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 770-778).
[15] Held, D., Thrun, S., & Savarese, S. (2016). Learning to track at 100 fps with deep
regression networks. In European Conference on Computer Vision, pp. 749-765.
Springer, Cham.
[16] Huang, W., Hu, R., Liang, C., Yu, Y., Wang, Z., Zhong, X., & Zhang, C. (2016, January).
Camera network based person re-identification by leveraging spatial-temporal
constraint and multiple cameras relations. In International Conference on Multimedia
Modeling (pp. 174-186). Springer, Cham.
[17] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016,
October). Ssd: Single shot multibox detector. In European conference on computer
vision (pp. 21-37). Springer, Cham.
[18] Lowe, D. G. (1999, September). Object recognition from local scale-invariant features.
In Proceedings of the seventh IEEE international conference on computer vision (Vol.
2, pp. 1150-1157). Ieee.
[19] Mittal, A., & Paragios, N. (2004, June). Motion-based background subtraction using
adaptive kernel density estimation. In Proceedings of the 2004 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. (Vol. 2,
pp. II-II). Ieee.
[20] Nam, H., & Han, B. (2016). Learning multi-domain convolutional neural networks for
visual tracking. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 4293-4302).
[21] Norris, C., McCahill, M., & Wood, D. (2004). The growth of CCTV: a global
perspective on the international diffusion of video surveillance in publicly accessible
space. Surveillance & Society, 2(2/3).
[22] Plantinga, A. (1961). Things and persons. The Review of Metaphysics, 493-519.
[23] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified,
real-time object detection. In Proceedings of the IEEE conference on computer vision
43
and pattern recognition (pp. 779-788).
[24] Ristani, E., & Tomasi, C. (2018). Features for multi-target multi-camera tracking and
re-identification. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 6036-6046).
[25] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.
[26] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.
[27] Tang, Z., Naphade, M., Liu, M. Y., Yang, X., Birchfield, S., Wang, S., ... & Hwang, J.
N. (2019). Cityflow: A city-scale benchmark for multi-target multi-camera vehicle
tracking and re-identification. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 8797-8806).
[28] Wang, H., Su, H., Zheng, K., Sadiq, S., & Zhou, X. (2013, January). An effectiveness
study on trajectory similarity measures. In Proceedings of the Twenty-Fourth
Australasian Database Conference-Volume 137 (pp. 13-22).
[29] Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. Acm computing
surveys (CSUR), 38(4), 13-es.
[30] Zhao, Z. Q., Zheng, P., Xu, S. T., & Wu, X. (2019). Object detection with deep learning:
A review. IEEE transactions on neural networks and learning systems, 30(11), 3212-
3232.
[31] Zivkovic, Z. (2004, August). Improved adaptive Gaussian mixture model for
background subtraction. In Proceedings of the 17th International Conference on Pattern
Recognition, 2004. ICPR 2004. (Vol. 2, pp. 28-31). IEEE.
[32] Zivkovic, Z., & Van Der Heijden, F. (2006). Efficient adaptive density estimation per
image pixel for the task of background subtraction. Pattern recognition letters, 27(7),
773-780.

指導教授

黃智遠

審核日期

2022-1-25

推文