Crawl Links Playwright 🕸️

一個基於 Playwright 的進階網頁爬蟲工具，支援：

✅ 真實瀏覽器渲染 — 能正確抓取 JavaScript 動態生成的連結與內容
🌐 同網域深度控制 — 僅深入同網域頁面，但會記錄所有連結
📊 進度條顯示 — 使用 tqdm 展示處理進度
📝 內容擷取合併 — 輸出所有頁面的純文字內容，方便做文字分析
🔗 Canonical 去重 — 避免重複頁面內容
✂️ NotebookLM 分檔 — 依字數切割大檔，符合 NotebookLM 單檔上限（20 萬字）

📦 安裝

git clone https://github.com/你的帳號/crawl-links-playwright.git
cd crawl-links-playwright

pip install -r requirements.txt
python -m playwright install chromium

requirements.txt 範例

playwright
tqdm
requests
beautifulsoup4

🚀 使用方式

基本爬取

python crawl_links_playwright.py \
  --start_url https://example.com \
  --max_depth 2 \
  --output links.csv --headless

擷取並合併文字內容

python crawl_links_playwright.py \
  --start_url https://example.com \
  --max_depth 2 \
  --output links.csv \
  --dump_text combined_content.txt \
  --headless

NotebookLM 分檔

python crawl_links_playwright.py \
  --start_url https://example.com \
  --max_depth 2 \
  --output links.csv \
  --dump_text combined_content.txt \
  --split_for_notebooklm \
  --split_word_limit 150000 \
  --headless

輸出：

combined_content_part1.txt
combined_content_part2.txt
...

⚙️ 主要參數

參數	說明
`--start_url`	起始網址 (必填)
`--max_depth`	最大深度 (必填)
`--output`	輸出 CSV 檔案名稱 (預設: links.csv)
`--delay`	每個頁面請求間隔秒數 (預設: 0.5)
`--wait_until`	等待事件 (load/domcontentloaded/networkidle/commit，預設 networkidle)
`--headless`	啟用無頭模式
`--user_agent`	自訂 User-Agent
`--accept_language`	語系設定 (預設: en-US,en;q=0.9)
`--dump_text`	合併文字輸出的檔案路徑
`--text_delay`	每次文字擷取的延遲 (預設: 1.0 秒)
`--dump_same_domain_only`	只輸出同網域文字內容
`--canonical_dedup / --no_canonical_dedup`	Canonical 去重開關 (預設啟用)
`--split_for_notebooklm`	啟用 NotebookLM 分檔功能
`--split_word_limit`	每檔最大字數 (預設: 150000)

📊 輸出格式

CSV (`links.csv`)

url	depth	source	same_domain
https://example.com	0		True

合併文字 (`combined_content.txt`)

---
來源: <https://example.com/page1>
Canonical: <https://example.com/page1>
---

[該頁面純文字]

==================================================

🔮 適用場景

建立網站全站索引
匯總網站文字內容供 NotebookLM / 向量資料庫 使用
資料備份或文字探勘 (Text Mining)

📁 專案結構

Crawl-Links-Playwright/
├── crawl_links_playwright.py   # 主程式：Playwright 爬蟲工具
├── README.md                   # 專案說明文件 (GitHub 首頁)
├── requirements.txt            # Python 套件需求清單
├── .gitignore                  # Git 忽略規則
├── LICENSE                     # MIT 授權條款
└── (輸出檔案)
    ├── links.csv               # 爬取結果 (連結清單)
    ├── combined_content.txt    # 合併文字內容 (若使用 --dump_text)
    ├── combined_content_part1.txt  # NotebookLM 分檔 (若啟用)
    └── combined_content_part2.txt  # NotebookLM 分檔 (若啟用)

📜 License

本專案採用 MIT License。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crawl Links Playwright 🕸️

📦 安裝

requirements.txt 範例

🚀 使用方式

基本爬取

擷取並合併文字內容

NotebookLM 分檔

⚙️ 主要參數

📊 輸出格式

CSV (`links.csv`)

合併文字 (`combined_content.txt`)

🔮 適用場景

📁 專案結構

📜 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawl_links_playwright.py		crawl_links_playwright.py
requirements.txt		requirements.txt

License

kpxx/crawl-links-playwright

Folders and files

Latest commit

History

Repository files navigation

Crawl Links Playwright 🕸️

📦 安裝

requirements.txt 範例

🚀 使用方式

基本爬取

擷取並合併文字內容

NotebookLM 分檔

⚙️ 主要參數

📊 輸出格式

CSV (links.csv)

合併文字 (combined_content.txt)

🔮 適用場景

📁 專案結構

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

CSV (`links.csv`)

合併文字 (`combined_content.txt`)

Packages