Minimizing Write Amplification of Data Deduplication System with ML based Fingerprint Store

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/98581

Please use this identifier to cite or link to this item: https://ir.lib.ncu.edu.tw/handle/987654321/98581

Title:	Minimizing Write Amplification of Data Deduplication System with ML based Fingerprint Store
Authors:	蔡仁豪;Tsai, Ren-Hao
Contributors:	資訊工程學系
Keywords:	固態硬碟;資料去重;機器學習;SSD;Data Deduplication;Machine Learning
Date:	2025-08-18
Issue Date:	2025-10-17 12:57:06 (UTC+8)
Publisher:	國立中央大學
Abstract:	固態硬碟（SSD）因其高吞吐量與低延遲，已成為現代計算的主要儲存媒介。然而，SSD 面臨嚴重的耐用性挑戰，主要源於寫入放大效應(write amplification) ——這是一種現象，裝置實際寫入的資料量超過主機請求的量，從而加速 NAND 的耗損並縮短裝置壽命。先前的研究提出了多種減輕寫入放大效應的技術，包括垃圾回收(garbage collection)優化、寫入合併(write coalescing)、超額預留空間(over-provisioning)，以及內容感知的方法（例如資料重複刪除）。雖然資料重複刪除能有效消除冗餘資料並減少寫入量，但在有限記憶體資源下維護大規模指紋索引（fingerprint indices）會帶來顯著的管理負擔。近期在機器學習（ML）方面的進展提供了新的可能性，可用於預測式快取與自適應驅逐策略（adaptive eviction），使系統能更準確地識別並保留具高使用價值的資料。本研究提出一種結合機器學習的快取策略，專為支援重複刪除功能的 SSD 所設計。該方法預測指紋條目的重複使用潛力，選擇性地保留預期效益高的條目於記憶體中，同時積極淘汰相關性較低的項目。此方法在記憶體受限的情況下提升了重複刪除快取命中率，並間接減輕寫入放大效應。實驗結果顯示，相較於傳統的啟發式方法，所提出之機器學習策略可達成更高的快取效率，並有效延長 SSD 的使用壽命。;Solid-state drives (SSDs) have become the primary storage medium in modern computing due to their high throughput and low latency. However, SSDs face significant endurance challenges, primarily driven by write amplification—a phenomenon where more data is written internally than is requested by the host, accelerating NAND wear and shortening device lifespan. Prior research has proposed various techniques to mitigate write amplification, including garbage collection optimizations, write coalescing, over-provisioning, and content-aware approaches such as data deduplication. While deduplication effectively eliminates redundant data and reduces write volume, it introduces substantial management overhead, especially when maintaining large-scale fingerprint indices within limited memory resources. Recent advances in machine learning (ML) offer new opportunities for predictive caching and adaptive eviction, enabling systems to better identify and retain high-utility data. In this work, we propose an ML-enhanced caching strategy for deduplication-aware SSDs. Our approach predicts the reuse potential of fingerprint entries, selectively retaining those with high expected utility in memory while aggressively evicting less relevant entries. This approach enhances deduplication cache hit rates under memory constraints and thereby indirectly mitigates write amplification. Experimental results demonstrate that our ML-based strategy achieves higher cache efficiency and prolongs SSD endurance compared to traditional heuristic-based methods.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	10	View/Open

社群 sharing

Loading...