python 樹狀標題父鏈追溯教學：用 _collect_parent_chain 取得階層路徑

為什麼需要父鏈 (Parent Chain)
在大型技術文件或規格說明中，一個深層標題 (例如「1.2.3.4 錯誤處理」) 若單獨抽出，往往會失去它所屬章節的語境。父鏈能將「從最高層到目前節點的標題序列」串回來，提供：

檢索標籤（為向量資料庫或全文索引增強）
檔名 / 切片內前導語境 (chain header)
RAG（檢索增強生成）時的上層語意補全
函式核心回顧

def _collect_parent_chain(headings: List[dict], h: dict) -> List[str]:
    chain = []
    curr = h
    while curr is not None:
        chain.append(curr.get('raw', '').strip())
        pidx = curr.get('parent_index')
        curr = headings[pidx] if pidx is not None else None
    return list(reversed(chain))

邏輯說明

從當前節點 h 開始，向上沿著 parent_index 一層一層走。
每一層的原始標題文字 raw（保留編號）加入 chain。
結束條件：parent_index 為 None（到根節點）。
最後 reversed 反轉成「頂 → 底」的順序。

簡化的 headings 資料結構範例

假設我們已經從文件解析出 6 個標題節點(堆疊演算法)：

headings = [
    { 'raw': 'Document Overview',         
    'level': 1, 'parent_index': None },  # index 0
    
    { 'raw': '1. System Basics',          
    'level': 2, 'parent_index': 0    },  # index 1
    
    { 'raw': '1.1 Architecture',          
    'level': 3, 'parent_index': 1    },  # index 2
    
    { 'raw': '1.1.1 Components',          
    'level': 4, 'parent_index': 2    },  # index 3
    
    { 'raw': '1.2 Data Flow',             
    'level': 3, 'parent_index': 1    },  # index 4
    
    { 'raw': '2. Advanced Topics',        
    'level': 2, 'parent_index': 0    },  # index 5
]

取某一深層標題的父鏈

target = headings[3]  # '1.1.1 Components'
chain = _collect_parent_chain(headings, target)
print(chain)

輸出：

如果想組成 chain header 字串：

封裝示例：加入清理與去編號版本

有時你想同時取得「原始父鏈」與「去掉前導編號的語義父鏈」：

import re

def collect_parent_chain(headings, h, strip_numbers=False):
    chain = []
    curr = h
    while curr is not None:
        raw = curr.get('raw', '').strip()
        if strip_numbers:
            # 去除前導編號（支援多段 1.2.3 格式）
            raw = re.sub(r'^(\d+(?:\.\d+)*)[\s.、]+', '', raw)
        chain.append(raw)
        pidx = curr.get('parent_index')
        curr = headings[pidx] if pidx is not None else None
    return list(reversed(chain))

# 使用
raw_chain  = collect_parent_chain(headings, headings[3], strip_numbers=False)
clean_chain = collect_parent_chain(headings, headings[3], strip_numbers=True)
print(raw_chain)
print(clean_chain)

輸出：

常見需求與變體

錯誤與防護建議

parent_index 越界：檢查 0 <= pidx < len(headings)。
raw 為空：替代 'untitled'。
循環鏈（理論上不該發生）：加計數器或集合防止無限迴圈

seen = set()
while curr is not None:
    idx = curr.get('heading_index')
    if idx in seen: break
    seen.add(idx)

深層節點很多導致檔名過長：在 slug join 前截斷每段長度。

小型練習：手動建立一個父鏈函式

試自己寫出不使用 reversed 的版本：

def collect_parent_chain_manual(headings, h):
    stack = []
    curr = h
    while curr is not None:
        stack.insert(0, curr.get('raw', '').strip())  # 直接前插
        pidx = curr.get('parent_index')
        curr = headings[pidx] if pidx is not None else None
    return stack

與 RAG / 檔案切片整合的策略

父鏈文字可放進子文件的第一段，建立「語境前導」。
可同時存：
- chain_raw: 原始標題列表
- chain_clean: 去編號清理後
- chain_slug: 適合索引/檔名
將 chain_slug 放進 metadata JSON，可加速後續查詢（例如：找所有屬於 Architecture 區塊的切片）。

精簡可複用版本

def build_parent_chain(headings, node, *, clean=False, keep_number=True):
    import re
    chain = []
    cur = node
    while cur is not None:
        text = cur.get('raw', '').strip()
        if clean:
            # 基本清理：去頭尾空白
            text = re.sub(r'\s+', ' ', text)
        if not keep_number:
            text = re.sub(r'^(\d+(?:\.\d+)*)[\\s.、]+', '', text)
        chain.append(text)
        pidx = cur.get('parent_index')
        cur = headings[pidx] if pidx is not None else None
    return list(reversed(chain))