Python 正則表達式進階課：(?i…) 內聯修飾符 vs re.I 全域標記，打造無情的「激進截斷」割草機！ `(?i)`、`flags=re.I`、`(?i:…)` 的差別

這篇教學文專門說明 Python `re` 模組裡的 ignore-case 寫法，
並且用 `IloConnectProcessStep` 當主例子。

重點先講在前面：

1. `flags=re.I` 會讓整條 regex 都忽略大小寫。

2. `(?i)` 也是全域 ignore-case，但必須放在整條 pattern 最前面。

3. `(?i:…)` 是區域 ignore-case，只影響括號裡那一段。

4. `re.search(r’IloConnect(?i)…’)` 在 Python 3.11 會報錯，
因為 bare `(?i)` 不能放在 pattern 中間。

## 1. 最基本的例子：把 `IloConnectProcessStep` 切成 `iloconnect`

import re

text = "IloConnectProcessStep"
pattern = r'(?i)(process|step).*$'

result = re.sub(pattern, '', text).strip().lower()

print("原始字串:", text)
print("pattern :", pattern)
print("結果    :", result)

輸出:

### 解釋

`(?i)(process|step).*$` 可以拆成三段：

1. `(?i)`：忽略大小寫。

2. `(process|step)`：抓到 `process` 或 `step` 任一個關鍵字。

3. `.*$`：從該關鍵字開始，一直到字串結尾，全部吃掉。

所以在 `IloConnectProcessStep` 裡，
regex 抓到的是 `ProcessStep`，
`re.sub(…, ”, text)` 把它整段刪掉，
剩下 `IloConnect`，最後 `.lower()` 變成 `iloconnect`。

—

## 2. 看看 regex 到底抓到了什麼

如果只看 `re.sub`，會不知道 regex 實際匹配了哪一段。
下面這段用 `re.search` 直接把匹配結果印出來：

import re

text = "IloConnectProcessStep"
pattern = r'(?i)(process|step).*$'

m = re.search(pattern, text)

print("原始字串:", text)
print("pattern :", pattern)

if m:
    print("group(0):", m.group(0))
    print("group(1):", m.group(1))
    print("span()  :", m.span())
else:
    print("沒有匹配")

### 輸出

### 解釋

1. `group(0)` 是整段匹配到的內容，也就是 `ProcessStep`。

2. `group(1)` 是第一個捕獲群組 `(process|step)` 的內容，也就是先命中的 `Process`。

3. `span()` 是匹配區段的起訖位置，代表從 index 10 開始到 21 結束。

這也證明了：這條 regex 並不是只抓 `Process` 這 7 個字母，而是抓了 `ProcessStep` 整段尾巴。

—

## 3. 多組案例：看看不同 OCR 字串會切出什麼

這段程式很適合觀察「激進截斷」實際效果：

import re

pattern = r'(?i)(process|step).*$'

samples = [
    "IloConnectProcessStep",
    "IloConnectPROCESSSTEP",
    "IloConnectProcessteXYZ",
    "PswitchPcieCheckProcessSte",
    "PowerGoodCheck",
    "SysfanCheckProeces",
    "FirstStepInitializationProcess",
]

for s in samples:
    m = re.search(pattern, s)
    cut = re.sub(pattern, '', s).strip().lower()

    if not cut:
        cut = s.lower()

    print("-" * 60)
    print("原字串   :", s)
    print("match     :", m.group(0) if m else None)
    print("group(1)  :", m.group(1) if m else None)
    print("span      :", m.span() if m else None)
    print("截斷後殘根 :", cut)

### 重點觀察

1. `IloConnectProcessStep` -> `iloconnect`

2. `IloConnectPROCESSSTEP` -> `iloconnect`

3. `IloConnectProcessteXYZ` -> `iloconnect`

4. `PswitchPcieCheckProcessSte` -> `pswitchpciecheck`

5. `PowerGoodCheck` -> 沒抓到，維持 `powergoodcheck`

6. `SysfanCheckProeces` -> 因為 `Proeces` 拼錯，不含真正的 `process`，所以抓不到，維持 `sysfancheckproeces`

7. `FirstStepInitializationProcess` -> 會從第一個 `Step` 開始截斷，所以只剩 `first`

這也是為什麼它被稱為「激進截斷」：只要看到 `process` 或 `step`，後面整段全部砍掉。

—

## 4. `flags=re.I` 和 `(?i)` 的基本等價

如果你的需求是讓整條 regex 都忽略大小寫，那下面兩種寫法效果很接近：

import re

text = "IloConnectPrOcEsSStep"

result1 = re.sub(r'(?i)(process|step).*$', '', text).strip().lower()
result2 = re.sub(r'(process|step).*$', '', text, flags=re.I).strip().lower()

print("原字串     :", text)
print("(?i) 寫法  :", result1)
print("flags=re.I :", result2)
print("是否相同   :", result1 == result2)

### 輸出

### 結論

對於這種整條 regex 都要 ignore-case 的情境：

– `(?i)` 放在最前面

– `flags=re.I`

兩者通常等價。

—

## 5. 真正的差別：作用範圍

差別不是「能不能忽略大小寫」，而是「哪一段要忽略大小寫」。

### 5-1 `flags=re.I`：整條 regex 一起放寬

import re

sample = "iloconnectPRoCeSsStep"

m_flag = re.search(r'IloConnect(process|step).*$', sample, flags=re.I)

print("sample:", sample)
print("flags=re.I 是否匹配:", bool(m_flag))

### 輸出

因為 `flags=re.I` 會讓前面的 `IloConnect` 跟後面的 `process|step` 全部不分大小寫。

—

### 5-2 `(?i:…)`：只讓局部忽略大小寫

如果您要的是：

1. 前面的 `IloConnect` 仍然大小寫敏感

2. 後面的 `process|step` 才忽略大小寫

那就要用 scoped inline flag：

import re

sample1 = "IloConnectPRoCeSsStep"
sample2 = "iloconnectPRoCeSsStep"

pattern = r'IloConnect(?i:(process|step).*)$'

m1 = re.search(pattern, sample1)
m2 = re.search(pattern, sample2)

print("pattern :", pattern)
print("sample1 :", sample1, "->", bool(m1))
print("sample2 :", sample2, "->", bool(m2))

### 輸出

### 解釋

`(?i:…)` 只會影響括號內那一段，所以：

– `IloConnect` 還是大小寫敏感

– `(process|step).*` 才是不分大小寫

這是 `flags=re.I` 做不到的局部控制。

—

## 6. 為什麼 `re.search(r’IloConnect(?i)…’)` 會報錯？

下面這行在 Python 3.11 會報錯：

import re

sample = "iloconnectPRoCeSsStep"

# 這行不要執行，會報錯
m_inline = re.search(r'IloConnect(?i)(process|step).*$', sample)

### 錯誤訊息

error: global flags not at the start of the expression at position 10

### 原因

bare `(?i)` 在 Python `re` 中會被當成「全域 flag」。

全域 flag 的規則是：

1. 必須放在整條 pattern 最前面

2. 不能插在 pattern 中間

所以：

### 合法

re.search(r'(?i)IloConnect(process|step).*$', sample)

### 不合法

re.search(r'IloConnect(?i)(process|step).*$', sample)

如果您要在中間局部開啟 ignore-case，正確寫法不是 `(?i)`，而是 `(?i:…)`。

## 7. 最貼近 `get_class_in_jsons_12.py` 的示範函數

先把 `get_class_in_jsons_12.py` 裡的 Level 0 邏輯直接翻成白話：

1. Level 0 只有在 Level 1、Level 2、Level 3 都找不到 `correct_name` 時才會啟動。

2. 它的目的不是「自動校正成標準答案」，而是做最後一道白名單檢查，判斷這個 OCR 字串是不是一個可以安全保留的新字。

3. 做法是先對 OCR 字串執行 `re.sub(r'(?i)(process|step).*$’, ”, original_name)`。

4. 這一步會把從第一個 `process` 或 `step` 開始到字串結尾的內容全部切掉，只保留前面的特徵字首。

5. 接著再把這個字首轉成小寫，去和檔名解析出來的 `highlights_tokens` 串接結果比對。

6. 如果這個字首存在於標題重點字串中，代表它很可能不是 OCR 垃圾，而是這份文件上下文中的合法 class name，於是 Level 0 就放行，不再往下做覆寫。

7. 如果連這一步都對不起來，Level 0 也不會硬放行，這筆資料就維持「沒有校正成功」的狀態。

也就是說，Level 0 的核心不是「修正」，而是：

> 字典全滅時，先把尾巴粗暴切掉，只留下前面的特徵字首，再拿這個字首去跟文件標題做最後一次身分核對。

下面這段就是把這個 Level 0 邏輯單獨抽出來，方便測試：

import re

def aggressive_cut(original_name: str) -> str:
    ocr_root_aggressive = re.sub(
        r'(?i)(process|step).*$',
        '',
        original_name
    ).strip().lower()

    if not ocr_root_aggressive:
        ocr_root_aggressive = original_name.lower()

    return ocr_root_aggressive


samples = [
    "IloConnectProcessStep",
    "IloConnectProcessteXYZ",
    "PswitchPcieCheckProcessSte",
    "SysfanCheckProeces",
]

for s in samples:
    print(f"{s:30} -> {aggressive_cut(s)}")

### 預期輸出

## 8. 實務記憶法

最後用最短版本記住三件事：

1. `flags=re.I`：整條 regex 都忽略大小寫。

2. `(?i)`：也是整條 ignore-case，但必須放在 pattern 最前面。

3. `(?i:…)`：只有括號裡那一小段 ignore-case。

如果您現在回頭看這行：

re.sub(r'(?i)(process|step).*$', '', original_name)

它是合法的原因很簡單：`(?i)` 就放在整條 pattern 的最前面。

儲蓄保險王

儲蓄險是板主最喜愛的儲蓄工具,最喜愛的投資理財工具則是ETF,最喜愛的省錢工具則是信用卡

Next Python: 壞 JSON 修補教學 json_repair .loads( bad_json_text ) »

Previous « Python 入門教學：搞懂 defaultdict 與 setdefault() 的差別

🚀 Python Pandas 實戰教學：df.map() vs df.apply() 到底差在哪？從資料清洗實例看懂核心差異

在處理 Pandas Data...

3 天 ago

攝影或3C

為什麼 Python 要用 `max` 配合 `key=lambda`？從找最長文字的 Span 談起 ; #spans:list[dict] ; max(spans, key=lambda s: len(s.get(“text”, “”)))

在處理資料（例如解析 PDF、...

3 天 ago

攝影或3C

Python PyMuPDF fitz 教學：從pdf中抓文字、抓 fonts、抓表格; pip install PyMuPDF ; import fitz ; text_dict = page.get_text(“dict”) #type(page) is pymupdf.Page

這份教學以一份可重現的示範 P...

1 週 ago

攝影或3C

別把中文洗掉：Python `isalnum()` vs `[^A-Za-z0-9]` 含/不含 CJK中日韓

這篇的目的很直接：幫你判斷在 ...

2 週 ago

攝影或3C

Python unicodedata 小教室：把 `café` 變成 `cafe`，因為大家搜尋時只會打 `cafe` ; import unicodedata ; normalized = unicodedata.normalize(“NFKD”, text) ; “”.join(ch for ch in normalized if not unicodedata.combining(ch))

`import unicode...

2 週 ago

argparse 超簡單教學：讓 Python 小工具聽懂你的指令; parser = argparse .ArgumentParser() ; parser.add_argument(“–name”) ; args = parser.parse_args()

# argparse 超簡單教...

2 週 ago

Python 正則表達式進階課：(?i…) 內聯修飾符 vs re.I 全域標記，打造無情的「激進截斷」割草機！ `(?i)`、`flags=re.I`、`(?i:…)` 的差別

Related Post

Recent Posts

🚀 Python Pandas 實戰教學：df.map() vs df.apply() 到底差在哪？從資料清洗實例看懂核心差異

為什麼 Python 要用 `max` 配合 `key=lambda`？從找最長文字的 Span 談起 ; #spans:list[dict] ; max(spans, key=lambda s: len(s.get(“text”, “”)))

Python PyMuPDF fitz 教學：從pdf中抓文字、抓 fonts、抓表格; pip install PyMuPDF ; import fitz ; text_dict = page.get_text(“dict”) #type(page) is pymupdf.Page

別把中文洗掉：Python `isalnum()` vs `[^A-Za-z0-9]` 含/不含 CJK中日韓

Python unicodedata 小教室：把 `café` 變成 `cafe`，因為大家搜尋時只會打 `cafe` ; import unicodedata ; normalized = unicodedata.normalize(“NFKD”, text) ; “”.join(ch for ch in normalized if not unicodedata.combining(ch))

argparse 超簡單教學：讓 Python 小工具聽懂你的指令; parser = argparse .ArgumentParser() ; parser.add_argument(“–name”) ; args = parser.parse_args()