Python 模糊搜尋實戰教學：使用 difflib 拯救手殘黨 ; difflib.get_close_matches(query.lower(), candidates, n=1, cutoff=0.6)

你是否曾經只記得檔案名稱的「大概」，

或者在搜尋時不小心打錯字（例如將 “Mac” 打成 “Mack”），

導致程式告訴你「找不到檔案」？

這篇文章將帶你一步步使用 Python 內建的 `difflib` 函式庫，

實作一個聰明的**檔案模糊搜尋系統**。

## 為什麼需要模糊搜尋 (Fuzzy Search)?

標準的字串比對通常是「一翻兩瞪眼」：

if "NIC MACK Test" in filename:
    # 只要差一個字母，這裡就是 False

但人類的記憶是模糊的，我們要的是：

> 「幫我找跟 ‘NIC MACK Test’ 很像的檔案」

## 核心工具：difflib

Python 的 `difflib` 模組專門處理序列比對。我們主要使用兩個功能：

1. **`SequenceMatcher.ratio()`**: 計算兩個字串的相似度分數（0.0 ~ 1.0）。

2. **`get_close_matches()`**: 從一堆字串中找出最像的幾個候選者。

—

## 實戰步驟

### 1. 資料清洗 (Data Cleaning)

搜尋的第一步是讓資料「好讀」。

假設檔名很長，且包含副檔名：

`[IEC FBT Open Test]…__COMMON__nic mac test.docx`

我們可以使用 `pathlib` 來處理路徑，並用 `.stem` 去掉 `.docx`，再用 `split` 切割出關鍵字。

from pathlib import Path

# 假設這是我們的檔案列表
files = [
    Path("C:/.../COMMON__nic mac test.docx"),
    Path("C:/.../COMMON__diagnostic test tools.docx")
]

# 技巧：使用 p.stem 去掉 .docx，讓最後切出來的關鍵字更乾淨
# 結果: {'nic mac test.docx': ['COMMON', 'nic mac test']}
dic_meta_list = {p.name : p.stem.split("__") for p in files}

### 2. 解決雜訊：設定門檻 (Cutoff)

這是最關鍵的一步。如果門檻設太低（預設 0.6，有時我們會從 0.4 開始試），會搜出一堆不相關的雜訊。

**案例分析：**

* 查詢：`”NIC MACK Test”`

* 目標：`”nic mac test”` (相似度 ~ 0.96)

* 雜訊：`”diagnostic test tools”` (相似度 ~ 0.42)

如果你設定 `cutoff=0.4`，雜訊就會跑出來。

**解法**：將 `cutoff` 提高到 **0.6**，過濾掉那個 0.42 的常見字串。

import difflib

query = "NIC MACK Test"
candidates = ["nic mac test", "diagnostic test tools"]

# cutoff=0.6 是一個很好的經驗值
matches = difflib.get_close_matches(query.lower(), candidates, n=1, cutoff=0.6)
# 結果: ['nic mac test'] (雜訊被濾掉了)

difflib.get_close_matches?

### 3. 排序與優化輸出

找到結果後，如果有一堆候選者，我們應該把「最像的」排在第一位。

# %%
# 計算分數並排序
# results 結構: [(分數, 檔名, 匹配片段), ...]
results.sort(key=lambda x: x[0], reverse=True)

## 完整程式碼範例

這是我們最終優化過的版本，包含了**去除副檔名**、**大小寫去敏感**、**門檻過濾**與**結果排序**：

from pathlib import Path
import difflib

# 1. 準備資料
dirname = Path(r"C:\你的資料夾路徑")
pathes  = dirname.glob("*.docx")

# 使用 p.stem 去掉副檔名再切割
dic_meta_list = {p.name : p.stem.split("__") for p in pathes}

user_query = "NIC MACK Test" # 故意打錯字測試

results = []

for meta, lis in dic_meta_list.items():
    # 轉小寫以進行不分大小寫比對
    lis_lower = [seg.lower() for seg in lis]
    
    # 策略 A: 完全命中 (Exact Match)
    if any(user_query.lower() in seg for seg in lis_lower):
        results.append((1.0, meta, "包含完整關鍵字"))
        continue

    # 策略 B: 模糊比對 (Fuzzy Match)
    # cutoff=0.6 過濾雜訊
    match_lower = difflib.get_close_matches(user_query.lower(), lis_lower, n=1, cutoff=0.6)
    
    if match_lower:
        best_match = match_lower[0]
        # 計算具體分數以便排序
        score = difflib.SequenceMatcher(None, user_query.lower(), best_match).ratio()
        results.append((score, meta, best_match))

# 3. 排序與輸出
# 分數高的排前面
results.sort(key=lambda x: x[0], reverse=True)

print(f"\n{'='*15} 搜尋結果 (共 {len(results)} 筆) {'='*15}")
for idx, (score, meta, match_info) in enumerate(results, 1):
    print(f"{idx:02d}. [相似度: {score:.2f}] {meta}")
    print(f"    └─ 匹配片段: {match_info}")
    print(f"{'-'*60}")