Python pandas df.assign 實戰教學：先造 PDF，再用 iterator 抽文字/表格，最後補欄位

1. 先用 Jupyter code 產生示範 PDF。

2. PDF 至少 2 頁，每頁至少 2 個表格。

3. 檔案輸出到 `D:\Temp`。

4. 再用 iterator 分流擷取「文字」與「所有表格」。

5. 最後用 `df.assign(…)` 示範 metadata 與衍生欄位。

—

## 0. 套件安裝

pip install PyMuPDF pandas

> 安裝名稱是 `PyMuPDF`，但 import 名稱是 `fitz`。

—

## 1. 先建立示範 PDF（2 頁，每頁 2 個表格）

下面這段可直接貼到 Jupyter 單一 cell 執行：

import os
import fitz

output_pdf = r"D:\Temp\fitz_df_assign_demo.pdf"
os.makedirs(os.path.dirname(output_pdf), exist_ok=True)


def draw_table(page: fitz.Page, left: float, top: float, headers: list[str], data: list[list[str]], row_h: float = 24):
    """畫出簡單格線表格並填入文字，回傳 bbox。"""
    col_count = len(headers)
    col_w = [120] * col_count

    # x 座標列表，例如 3 欄會有 4 個垂直線座標
    xs = [left]
    for w in col_w:
        xs.append(xs[-1] + w)

    # y 座標列表（header + data rows）
    row_count = 1 + len(data)
    ys = [top + i * row_h for i in range(row_count + 1)]

    # 畫垂直線
    for x in xs:
        page.draw_line((x, ys[0]), (x, ys[-1]), width=1)

    # 畫水平線
    for y in ys:
        page.draw_line((xs[0], y), (xs[-1], y), width=1)

    # 寫 header
    for c, header in enumerate(headers):
        page.insert_text((xs[c] + 6, ys[0] + 16), header, fontsize=10, fontname="Helvetica-Bold")

    # 寫 body
    for r, row in enumerate(data, start=1):
        for c, val in enumerate(row):
            page.insert_text((xs[c] + 6, ys[r] + 16), str(val), fontsize=10, fontname="Helvetica")

    return (xs[0], ys[0], xs[-1], ys[-1])


doc = fitz.open()

for page_no in [1, 2]:
    page = doc.new_page(width=595, height=842)

    # 頁首文字
    page.insert_text((40, 40), f"Demo Report - Page {page_no}", fontsize=16, fontname="Helvetica-Bold")
    page.insert_text((40, 64), "This page includes multiple tables for iterator demo.", fontsize=11, fontname="Helvetica")

    # 第一個表格
    draw_table(
        page=page,
        left=40,
        top=100,
        headers=["Item", "Class", "Capacity"],
        data=[
            [f"P{page_no}-A1", "Module", "16GB"],
            [f"P{page_no}-A2", "RDIMM", "32GB"],
            [f"P{page_no}-A3", "UDIMM", "64GB"],
        ],
    )

    # 第二個表格
    draw_table(
        page=page,
        left=40,
        top=280,
        headers=["TestID", "Status", "Duration"],
        data=[
            [f"P{page_no}-T1", "PASS", "12.3s"],
            [f"P{page_no}-T2", "FAIL", "9.8s"],
            [f"P{page_no}-T3", "PASS", "11.1s"],
        ],
    )

    # 這段故意放一般文字，讓你看到後續文字流與表格流分離
    page.insert_text((40, 520), f"Footer note on page {page_no}: mixed content extraction demo.", fontsize=10, fontname="Times-Roman")

doc.save(output_pdf)
doc.close()

print(output_pdf)

執行後你會得到：

– `D:\Temp\fitz_df_assign_demo.pdf`

## 2. 用 iterator 分流：抽文字 + 抽所有表格

這裡用兩個 iterator：

– `iter_text_lines_from_page(…)`: 專門吐出單頁文字 line

– `iter_tables_from_page(…)`: 專門吐出單頁所有 table

補充：這裡的 iterator 設計是
「從 `page` 才開始發動」，
不是「從 `doc` 開始發動」。

也就是說，`doc` 只負責外層逐頁迭代（orchestration），
真正的抽取邏輯都在單頁 iterator 內完成。

為什麼這樣設計：

– `text` 與 `tables` 天然都是 page-level 資料，
先以 page 為邊界最不容易混頁。

– `page_no`、`table_id` 等 metadata 可在外層統一補上，
避免在多個函式重複傳遞。

– 可保留 iterator 的串流特性，
避免先把整份 `doc` 全量展開到記憶體。

在寫 iterator 之前，先把 `text_dict` 結構看清楚：

text_dict = page.get_text("dict")  # type(page) is pymupdf.Page
    -> blocks :list[dict]
        -> block :dict #dict_keys(['number', 'type', 'bbox', 'lines'])

            -> lines :list[dict]
                -> line :dict #dict_keys(['spans', 'wmode', 'dir', 'bbox'])

                    -> spans :list[dict]
                        -> span :dict #dict_keys(['size', 'flags', 'bidi', 'char_flags', 'font', 'color', 'alpha', 'ascender', 'descender', 'text', 'origin', 'bbox'])

對照一下：`text` iterator 是沿著上面這個巢狀結構走訪；

`tables` iterator 則直接使用 `page.find_tables()` 的結果，

不走 `text_dict -> blocks -> lines -> spans` 這條路徑。

import fitz
import pandas as pd
from typing import Iterator, Any

pdf_path = r"D:\Temp\fitz_df_assign_demo.pdf"


def iter_text_lines_from_page(page: fitz.Page) -> Iterator[dict[str, Any]]:
    """只處理單一 page，回傳 line 粒度的文字紀錄。"""
    text_dict = page.get_text("dict")

    for block_no, block in enumerate(text_dict.get("blocks", [])):
        if block.get("type") != 0:
            continue

        for line_no, line in enumerate(block.get("lines", [])):
            spans = line.get("spans", [])
            if not spans:
                continue

            text_parts: list[str] = []
            main_span: dict[str, Any] | None = None
            main_len = -1

            # 單一迴圈同時完成 text 拼接與 main_span 判定
            for sp in spans:
                sp_text = sp.get("text") or ""
                text_parts.append(sp_text)

                sp_len = len(sp_text)
                if sp_len > main_len:
                    main_len = sp_len
                    main_span = sp

            text = "".join(text_parts)
            if not text.strip() or main_span is None:
                continue

            yield {
                "block_no": block_no,
                "line_no": line_no,
                "text": text,
                "font_main": main_span.get("font") or "",
                "flags_main": main_span.get("flags", 0),
            }


def iter_tables_from_page(page: fitz.Page) -> Iterator[dict[str, Any]]:
    """只處理單一 page，回傳這一頁每張表的基本資訊與 DataFrame。"""
    tables = page.find_tables().tables
    for table_no, table in enumerate(tables, start=1):
        table_cell_matrix: list[list[str]] = table.extract()
        if not table_cell_matrix:
            continue

        # 直接把 2D list 轉成 DataFrame：欄名預設就是 0, 1, 2, 3, ...
        table_df = pd.DataFrame(table_cell_matrix)

        yield {
            "table_no": table_no,
            "bbox": table.bbox,
            "table_df": table_df,
        }


doc = fitz.open(pdf_path)

# 文字流與表格流（逐 page 處理，不先把整份文件一次展開）
text_records = []
all_table_frames = []
tables_index_records = []

# 外層只做 page orchestration：逐頁驅動兩條 iterator（tables / text）
for page_no, page in enumerate(doc, start=1):
    # 先處理這一頁的所有 tables
    for table_record in iter_tables_from_page(page):
        table_df = table_record["table_df"]
        table_no = table_record['table_no']
        table_id = f"p{page_no}_t{table_no}"

        # # 在這裡先示範 assign：把 metadata 直接補進每張表
        # table_df = table_df.assign(
        #     page=page_no,
        #     table_no=table_no,
        #     table_id=table_id,
        # )

        # 這裡直接補同一張表的 metadata；在這種情況下，效果和 assign 一樣
        table_df["page"] = page_no
        table_df["table_no"] = table_no
        table_df["table_id"] = table_id


        all_table_frames.append(table_df)

        # 每張表只記一筆索引，避免把 table-level 資訊重複到每個資料列
        tables_index_records.append({
            "page": page_no,
            "table_no": table_record["table_no"],
            "table_id": table_id,
            "bbox": table_record["bbox"],
            "row_count": table_df.shape[0],
            "col_count": table_df.shape[1],
        })

    # 再處理這一頁的 text lines
    for line_record in iter_text_lines_from_page(page):
        # 這裡還是 dict 階段，不是 DataFrame，所以用 dict merge 補 page。
        # 等到收集完成後才一次轉成 raw_text_df。
        text_records.append({"page": page_no, **line_record})

doc.close()

# 合併所有表格
# 走完全部 page 後，才把 list[dict] / list[df] 轉成 DataFrame
raw_text_df = pd.DataFrame(text_records)

# pandas 縱向 concat 對不同欄數的 DataFrame 不會報錯：
# 會用欄位聯集對齊，缺少的位置補 NaN。
all_tables_df = pd.concat(all_table_frames, ignore_index=True) if all_table_frames else pd.DataFrame()
tables_index_df = pd.DataFrame(tables_index_records)

# 輸出到 Excel：文字一個 sheet、表格一個 sheet（外加 table 索引）
excel_output_path = r"D:\Temp\fitz_df_assign_demo_output.xlsx"
with pd.ExcelWriter(excel_output_path, engine="openpyxl") as writer:
    # line-level 文字資料
    raw_text_df.to_excel(writer, sheet_name="text", index=False)
    # 所有 table 的資料列（已帶 page/table_no/table_id）
    all_tables_df.to_excel(writer, sheet_name="tables", index=False)
    # 每張表一列的索引/稽核資訊
    tables_index_df.to_excel(writer, sheet_name="tables_index", index=False)

print("raw_text_df shape:", raw_text_df.shape)
print("tables_index_df shape:", tables_index_df.shape)
print("all_tables_df shape:", all_tables_df.shape)
print("excel_output_path:", excel_output_path)

raw_text_df.head(10), tables_index_df, all_tables_df.head(10)

fitz_df_assign_demo_output.xlsx
(sheet : tables)

all_tables_df

fitz_df_assign_demo_output.xlsx
(sheet : text)

## 3. df.assign 重點示範

`assign` 的關鍵觀念：

– 會回傳新 DataFrame，不會原地改動。

– 同一次 `assign` 內，後面的欄位可引用前面剛建立的欄位。

### 3.1 固定 metadata 欄位

這裡把兩種寫法放在一起比對：

在這種單純補常數欄位的情況下，

`assign` 和直接指派的結果一樣；

差別只在寫法風格。

demo_df = all_tables_df.assign(source="fitz.find_tables")
demo_df2 = all_tables_df.copy()
demo_df2["source"] = "fitz.find_tables"

demo_df.equals(demo_df2)
demo_df.head(5)

demo_df.equals(demo_df2)

### 3.2 在同一次 assign 中建立相依欄位

# 這裡直接用第 3 欄（欄位位置 2）來示範字串轉數字與衍生欄位
enriched_df = all_tables_df.assign(
    duration_sec=lambda x: pd.to_numeric(x[2].str.replace("s", "", regex=False), errors="coerce"),
    duration_ms=lambda x: x["duration_sec"] * 1000,
    row_key=lambda x: x["table_id"].astype(str) + "_r" + x.index.astype(str),
)

enriched_df.head(10)

enriched_df.head(10)

all_tables_df.columns

Index([0, 1, 2, 'page', 'table_no', 'table_id'], dtype='object')

你可以觀察：

– `enriched_df` 比 `all_tables_df` 多了

`duration_sec`、`duration_ms`、`row_key` 三個欄位。

– `duration_ms` 可直接引用同一個 `assign` 內剛產生的 `duration_sec`。

– `row_key` 可把 `table_id` + row index 合成可追蹤鍵。

## 4. 一個更貼近實務的整理方式

通常你會輸出成一個 xlsx 檔，但裡面放三個 sheet：

1. `raw_text_df`：文字主流（line 級）。

2. `tables_index_df`：表格索引（每張表一列，含 `bbox`、`table_id`）。

3. `all_tables_df` 或 `enriched_df`：全部表格列資料（含 metadata）。

這樣可以保持「文字流 / 表格流」分離，
又能透過 `table_id`、`page` 做關聯，
而且也符合 Excel 一個工作簿、三個 sheet 的整理方式。

## 5. 人眼友善版本：縱向合併 + 分隔列（無灰底）

如果你希望在同一份縱向資料中，
讓不同 `table_id` 之間更容易閱讀，
可以在每張表前先插入一列 `separator`。

下面這段以 `tables_view` 為單一來源：

– 給人看：保留 separator 列

– 給程式處理：用 `row_type == “data”` 過濾即可

# %%
view_parts = []

for table_df in all_table_frames:
    if table_df.empty:
        continue

    d = table_df.copy()
    # 用 iloc 取第一列（位置語意），避免索引不是 0,1,2... 時 d["col"][0] 的歧義/風險
    table_id = str(d["table_id"].iloc[0])
    page_no = d["page"].iloc[0]
    table_no = d["table_no"].iloc[0]

    # 分隔列：讓人眼可快速辨識不同 table_id 區塊
    sep = pd.DataFrame([{
        "row_type": "separator",
        "table_id": f"----- {table_id} -----",
        "page": page_no,
        "table_no": table_no,
        "display_label": f"Page {int(page_no)} / Table {int(table_no)} / {table_id}",
    }])

    view_parts.append(sep)
    view_parts.append(d.assign(row_type="data", display_label=""))

all_tables_view = pd.concat(view_parts, ignore_index=True) if view_parts else pd.DataFrame()

# 程式處理時，直接過濾出純 data 列即可
# 這樣 tables_view 同時可做人眼閱讀與程式輸入（單一來源）
tables_data_only = all_tables_view.query("row_type == 'data'").copy()

pretty_output_path = r"D:\Temp\fitz_df_assign_demo_pretty.xlsx"
with pd.ExcelWriter(pretty_output_path, engine="openpyxl") as writer:
    all_tables_view.to_excel(writer, sheet_name="tables_view", index=False)

print("pretty_output_path:", pretty_output_path)
print("tables_view shape:", all_tables_view.shape)
print("tables_data_only shape:", tables_data_only.shape)