Python 打造高容錯搜尋引擎：BM25、Bigram 與difflib自動糾錯實戰

在企業級的文檔處理中，檔案名稱往往冗長且包含各種編號（如 IECFBT SPEC__…__NIC MAC TEST__replaced.docx）。使用傳統的 in 關鍵字搜尋往往只能「全有全無」，無法處理錯字或優先權排序。

本文將展示如何結合 BM25（權重排序）、Bigram（連詞強關聯）與 difflib（拼字檢查），打造一個能讀懂「使用者意圖」的智能檢索系統。

核心概念三劍客
在開始寫程式碼之前，我們先理解為什麼需要這三個元件：

Bigram (N-gram 語言模型)

問題：搜尋 “Scan Barcode” 時，如果只斷詞成 [‘scan’, ‘barcode’]，可能會搜到很多只有 “Scan” 但無關的檔案。
解法：Bigram 會多產生一個 scan_barcode 的特徵。這就像是把「Scan」和「Barcode」用強力膠黏在一起，搜尋時能大幅提高精準度。

BM25 (Best Matching 25)

問題：搜尋結果有 100 個，哪一個最重要？
解法：BM25 是各種搜尋引擎（如 Elasticsearch）的演算法基石。它不只看關鍵字出現幾次，還會看這個字「稀不稀有」。稀有字的權重更高。

difflib (模糊比對)

問題：使用者輸入 “NIC MACK Test” (多打一個 K)，導致搜尋結果為 0。
解法：利用 difflib 找出 MACK 與詞庫中的 MAC 很像，自動建議或修正，讓搜尋容錯率大增。

實戰環境準備
這是一個真實場景的模擬，我們針對指定的測試資料夾進行索引建立。

步驟一：路徑設定與資料載入
首先，我們需要鎖定目標資料夾，並進行基礎的檔名清洗。

from pathlib import Path

# 真實路徑設定
BASE_DIR = Path(r"C:\Users\OneDrive\RCCA")

# 模擬讀取 docx 檔案
if not BASE_DIR.exists():
    print("⚠️ 路徑不存在，請確認 VPN 或 OneDrive 狀態")
else:
    file_list = list(BASE_DIR.glob("*.docx"))
    print(f"📂 成功載入 {len(file_list)} 個文件。")

步驟二：斷詞與 Bigram 實作 (Tokenization)
這是引擎的「消化系統」。我們要將原本的字串切碎，並生成連詞。

import jieba

STOPWORDS = {"docx", "replaced", "v1", "spec"}

def tokenize(text, use_bigram=True):
    """
    輸入: "NIC MAC Test"
    輸出: ['nic', 'mac', 'test', 'nic_mac', 'mac_test']
    """
    # 1. 基礎斷詞 (這裡簡化示範，實戰可加入 OpenCC 繁簡轉換)
    words = jieba.lcut_for_search(text.lower())
    
    # 2. 過濾雜訊
    tokens = [w for w in words if w.strip() and w not in STOPWORDS]
    
    # 3. Bigram (核心關鍵！)
    # 將相鄰的詞兩兩合併，強化「詞組」的搜尋權重
    if use_bigram and len(tokens) > 1:
        bigrams = [f"{tokens[i]}_{tokens[i+1]}" for i in range(len(tokens)-1)]
        tokens.extend(bigrams)
        
    return tokens

# 測試效果
print(tokenize("Scan Barcode Test"))
# 輸出: ['scan', 'barcode', 'test', 'scan_barcode', 'barcode_test']

步驟三：拼字檢查 (difflib)

這是引擎的「容錯機制」。

import difflib

# 假設這是我們從所有檔名中建立好的全域詞庫
VOCAB_SET = {'nic', 'mac', 'test', 'cpu', 'stress', 'scan', 'barcode'}

def smart_correction(word):
    # 1. 如果字原本就是對的，直接回傳
    if word in VOCAB_SET:
        return word
    
    # 2. 複合詞優化 (例如 cpu_stress)
    # 邏輯：有底線 且 切開後的每個部分都是對的 -> 視為合法
    if "_" in word and all(sub in VOCAB_SET for sub in word.split("_")):
        return word

    # 3. 模糊比對
    # 找最像的一個字，相似度門檻設為 0.6
    matches = difflib.get_close_matches(word, VOCAB_SET, n=1, cutoff=0.6)
    
    if matches:
        print(f"   💡 修正建議: '{word}' -> '{matches[0]}'")
        return matches[0]
        
    return word

步驟四：整合 BM25 搜尋引擎

最後，將所有零件組裝起來。

from rank_bm25 import BM25Okapi

# 1. 建立索引 (Index)
# 假設我們已經把所有檔名都斷好詞 stored 在 corpus 裡
corpus = [
    tokenize("NIC MAC Test Report"),
    tokenize("CPU Stress Test Log"),
    tokenize("Scan Barcode Module")
]

bm25 = BM25Okapi(corpus)

# 2. 使用者查詢 (Query)
user_query = "NIC MACK Test"  # 使用者手誤打錯了 MAC

# A. 前處理：先做拼字檢查
query_tokens = tokenize(user_query, use_bigram=False) # Query 通常不做 Bigram 以免過度發散
corrected_tokens = [smart_correction(t) for t in query_tokens]
# 結果: ['nic', 'mack', 'test'] -> ['nic', 'mac', 'test']

# B. 搜尋：BM25 算分
scores = bm25.get_scores(corrected_tokens)

# 3. 結果展示
print(f"搜尋關鍵字: {corrected_tokens}")
print(f"文件相關性得分: {scores}")