本研究聚焦於光學衛星影像在地表資訊提取過程中,因雲層遮蔽而導致影像缺失 與品質下降。被雲區遮擋的部分常使地表細節無法還原,進而影響多時相分析與定量 反演的精準度。針對此問題,本研究提出一種基於條件擴散模型(Conditional Diffusion Model)的單幅衛星影像雲去除方法。 首先,利用有雲與無雲之多光譜光學(Band 2–4)與輔助(Band 9–11)影像,透 過像素級差分結合 K-means 無監督分類,自動生成二值化雲遮罩,同時以 GDAL 與 Rasterio 完成多波段影像的幾何校正與對齊。接著,將校正後影像切分為 128×128 像 素 Patch,並依雲覆蓋比例與無效像素閾值進行過採樣與篩選,透過資料翻轉與亮度調 整以構建均衡且多樣的訓練資料集。 模型架構採三模組設計:時間嵌入(Sinusoidal Encoding + MLP)、條件編碼器 (多層卷積結合 Time-Condition Fusion Block 提取多尺度雲特徵)與去噪自編碼器 (基於 UNet 結構和 Time-Condition Fusion Block 組成)。訓練階段引入 Sigmoid β 排程與 Curriculum-t 取樣策略,並以動態加權的混合損失函數(ε-loss、ŷ₀-loss 及加權 MS-SSIM),輔以自動混合精度(AMP)、指數移動平均(EMA)與梯度累積技術, 確保模型在去噪、細節還原與結構保真度之間取得最佳平衡。推論時則採用精簡版 DDIM 演算法(5–10 步),並以重疊區平均重建與雲遮罩融合輸出最終無雲影像。最 終以PSNR 和 SSIM 定量指標來測試生成的圖片質量。;This study addresses the challenges posed by cloud occlusion in optical satellite imagery, which can obscure surface details and degrade the accuracy of multi-temporal analyses and quantitative retrieval. To overcome this, we propose a conditional diffusion-based method for single-image cloud removal on cloud data. A binary cloud mask is first generated by applying pixel-level differencing and K-means clustering to multi-spectral optical (Bands 2–4) and auxiliary infrared (Bands 9–11) images. These images are then geometrically corrected and co-registered using GDAL and Rasterio. The aligned data are partitioned into 128×128 pixel patches, which are oversampled and filtered based on cloud coverage ratio and invalid-pixel thresholds, forming a balanced and diverse training dataset. Our model architecture comprises three modules: (1) a time embedding unit employing sinusoidal encoding and an MLP, (2) a conditional encoder extracting multi-scale cloud representations through stacked convolutions and Time-Condition Fusion Blocks, and (3) a denoising autoencoder built upon a U-Net backbone integrated with Time-Condition Fusion Blocks. During training, we adopt a Sigmoid β schedule and a Curriculum-t sampling strategy, optimizing a dynamically weighted loss that combines ε-loss, ŷ₀-loss, and weighted MS-SSIM. Automatic mixed precision (AMP), exponential moving average (EMA), and gradient accumulation techniques are utilized to balance denoising performance, detail preservation, and structural fidelity. For inference, a simplified DDIM sampler with 5–10 steps is used, followed by overlap-averaging reconstruction and cloud mask fusion to generate the final cloud-free output. The resulting images are quantitatively evaluated using PSNR and SSIM metrics.