Resolve merge conflicts (CHATGPT_CONTEXT + book_scraper)

feat/download-progress-abort
peter.fong 2 weeks ago
commit 76f68cc1fc

@ -1,110 +1,86 @@
# ChatGPT Project Context Bookscraper / Celery Branch ## 1. Scraper Status (STABIEL NIET aanpassen zonder toestemming) De Python-based bookscraper werkt volledig end-to-end. De volgende onderdelen zijn getest, stabiel en mogen niet worden herschreven of opgeschoond zonder expliciete toestemming: ### prepare_scripts() Genereert 3 shell-scripts in de output map: - say.txt — alleen TTS (Sinji voice, timestamps) - makebook.txt — alleen m4b merge + move - allinone.txt — TTS + merge + move ### Volume structuur output/<book>/<site>/v1, v2, v3, … ### Chapter-output - Chapter 1 bevat een header - Overige chapters bevatten alleen tekst ### Werkende functionaliteit - Rate limiter - Chapter parsing - Description parsing - Cover download - Skip-logica - prepare_scripts() ## 2. Ontwikkelregels voor ChatGPT - Geen stille rewrites - Geen opschonen van werkende code - Geen herstructurering zonder toestemming - Wijzigingen minimalistisch en doelgericht - Diff/patch-stijl - Exact aangeven welke bestanden worden geraakt - Directorystructuur behouden ## 3. Focusgebied (celery_branch) - Celery worker architectuur verbeteren - Queueing & retry policies - Stabiliteit & observability - Integratie met scraping tasks - Download functionaliteit eerst; audio later ## 4. Omgeving - VS Code Dev Containers - Docker Compose aanwezig - Celery + Redis in gebruik - Redis via hostmachine: redis://host.docker.internal:6379 ## 5. Download Worker (NIEUW Werkend) - download_worker.py volledig operationeel - Tasks worden correct geregistreerd: - tasks.scraping.download_chapter_task - tasks.scraping.download_chapter - tasks.scraping.scrape_book - Redis broker/backend correct geladen via .env - Worker pakt taken op en voert ze uit - Download pipeline werkt end-to-end ## 6. Logbus Status (NIEUW) - Logbus gebruikt nu REDIS_BACKEND of REDIS_BROKER - Geen crashes meer - Logging is non-blocking ## 7. Nog te bouwen (Download-Only fase) - DownloadController voor bulk-downloads - Flask API endpoint voor download - scrape_worker / audio_worker later ## 8. Project Tree (samengevat) scraper/ tasks/ worker/ logbus/ app.py docker-compose.yml OUTPUT structure blijft behouden. ChatGPT Project Context Bookscraper / Celery Branch
find . \ (Plaatsen in /docs/CHATGPT_CONTEXT.md of in de repo root)
-path "./output" -prune -o \
-path "_/**pycache**" -prune -o \
-print | sed -e 's;[^/]_/; |;g;s;|;|--;'
# CHATGPT_CONTEXT.md 1. Scraper Status (NIET AANPASSEN ZONDER TOESTEMMING)
## 📘 Project Context — BookScraper (Celery Pipeline) De Python-based bookscraper is volledig functioneel.
De volgende onderdelen zijn stabiel en mogen niet worden overschreven, herschreven of opgeschoond zonder expliciete toestemming:
**Status: 1 december 2025** prepare_scripts() genereert drie scripts:
Dit document brengt ChatGPT instant up-to-speed bij een nieuwe sessie. say.txt: alleen het TTS-script (bash, timestamps, Sinji voice, safe)
Het beschrijft de actuele architectuur, flow, directorylayout en pending features.
--- makebook.txt: alleen m4b merge + move
## ✅ Systeemarchitectuur (huidige implementatie) allinone.txt: TTS + merge + move
Het project is een **gedistribueerde Chinese webnovel-scraper** met: Volume-structuur: v1, v2, v3, …
- **Flask Web GUI** (`web`) Chapter-output:
- **Celery workers** per taaktype:
- `scraping` → boek metadata & chapterlist
- `controller` → pipelines dispatchen
- `download` → HTML binnenhalen
- `parse` → content extraheren + cleanen + hoofdstukheader maken
- `save` → tekst wegschrijven naar disk + volume-indeling
- `audio` → (later) m4b generation
- **Redis** als broker en backend
Alle workers hebben dezelfde codebase gemount via: Chapter 1 bevat een header:
``` URL: <chapter-url>
volumes: Description:
- .:/app <beschrijving>
```
--- ---
## 📂 Directorystructuur (relevant deel) Overige chapters hebben alleen de tekst
```
bookscraper/
├── scraper/
│ ├── tasks/
│ │ ├── scraping.py
│ │ ├── controller_tasks.py
│ │ ├── download_tasks.py
│ │ ├── parse_tasks.py
│ │ ├── save_tasks.py
│ │ └── pipeline.py
│ ├── download_controller.py
│ ├── utils.py
│ ├── sites.py
│ ├── book_scraper.py
├── docker/
│ ├── Dockerfile.scraper
│ ├── Dockerfile.web
│ └── Dockerfile.audio
├── docker-compose.yml
└── CHATGPT_CONTEXT.md
```
--- Rate limiter werkt
## 🔄 Book Pipeline (zoals nu werkt) Chapter parsing werkt
1. **start_scrape_book(url)** Description parsing werkt
2. **launch_downloads(scrape_result)**
3. **download_chapter(num, url)**
4. **parse_chapter(download_result)**
5. **save_chapter(parsed, volume_path)**
Volumes worden correct weggeschreven naar: Cover download werkt
``` Skiplogica werkt correct
BOOKSCRAPER_OUTPUT_DIR=/Users/peter/Desktop/books
```
--- 2. Ontwikkelregels voor ChatGPT
## 🚀 Features op de planning Nooit bestaande werkende code verwijderen
1. download retry + delay +429 pause Geen stille rewrites
1. Cover downloaden
1. Make scripts
1. Audio pipeline integratie
--- Geen herstructurering zonder toestemming
## 🧵 Afspraak Wijzigingen worden minimalistisch en doelgericht toegepast
**ChatGPT moet altijd eerst om bestaande code vragen voordat het wijzigingen voorstelt.** Bij voorkeur veranderingen in diff/patch-stijl
--- Altijd aangeven welke bestanden worden geraakt
Directorystructuur behouden:
output/<book>/<site>/v1 etc.
3. Huidige Focus: celery_branch
ChatGPT moet zich richten op:
## 🎯 Samenvatting Celery worker architectuur verbeteren
Je hebt nu: Queueing & retry policies
✔ Een schaalbare, modulaire Celery pipeline Stabiliteit & observability
✔ Volume-splitting
✔ Headers voor hoofdstuk 1 Integratie met scraping tasks
✔ Correcte I/O paths met host mount
✔ Stabiele end-to-end scraping flow Zonder scraperfunctie te breken
4. Omgeving
Project draait in VS Code Dev Containers
Docker Compose structuren aanwezig
Celery + queue + worker containers in gebruik
Gebruik deze context in alle antwoorden.
find . \
-not -path "_/**pycache**_" \
-not -name "_.pyc" \
-print | sed -e 's;[^/]_/; |;g;s;|;|--;'

@ -2,7 +2,7 @@
import requests import requests
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse from urllib.parse import urljoin
from scraper.logger import log_debug from scraper.logger import log_debug
from scraper.utils import clean_text, load_replacements from scraper.utils import clean_text, load_replacements
@ -11,8 +11,11 @@ from scraper.models.book_state import Chapter
class BookScraper: class BookScraper:
""" """
Lightweight scraper: only metadata + chapter list. Minimal scraper: only metadata + chapter list.
All downloading/parsing/saving is handled by Celery tasks. The DownloadController handles Celery pipelines for:
- download
- parse
- save
""" """
def __init__(self, site, url): def __init__(self, site, url):
@ -23,17 +26,17 @@ class BookScraper:
self.book_author = "" self.book_author = ""
self.book_description = "" self.book_description = ""
self.cover_url = "" self.cover_url = ""
self.chapter_base = None
self.chapters = [] self.chapters = []
self.chapter_base = None
# Load custom replacements # Load custom replacements
extra = load_replacements("replacements.txt") extra = load_replacements("replacements.txt")
self.site.replacements.update(extra) self.site.replacements.update(extra)
# ------------------------------------------------------------ # ------------------------------------------------------------
def parse_book_info(self): def execute(self):
"""Parse title, author, description, cover from the main page.""" """Main entry point. Returns metadata + chapter URLs."""
soup = self._fetch(self.url) soup = self._fetch(self.url)
self._parse_title(soup) self._parse_title(soup)
@ -41,13 +44,25 @@ class BookScraper:
self._parse_description(soup) self._parse_description(soup)
self._parse_cover(soup) self._parse_cover(soup)
# Parse chapter list page + chapter links
chapter_page = self.get_chapter_page(soup) chapter_page = self.get_chapter_page(soup)
self.parse_chapter_links(chapter_page) self.parse_chapter_links(chapter_page)
log_debug(f"[BookScraper] Completed metadata parse")
return {
"title": self.book_title,
"author": self.book_author,
"description": self.book_description,
"cover_url": self.cover_url,
"book_url": self.url,
"chapters": [
{"num": ch.number, "title": ch.title, "url": ch.url}
for ch in self.chapters
],
}
# ------------------------------------------------------------ # ------------------------------------------------------------
def _fetch(self, url): def _fetch(self, url):
"""Simple fetch (no retry), DownloadController handles errors."""
log_debug(f"[BookScraper] Fetch: {url}") log_debug(f"[BookScraper] Fetch: {url}")
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10) resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=10)
resp.encoding = self.site.encoding resp.encoding = self.site.encoding
@ -74,7 +89,6 @@ class BookScraper:
parts = [] parts = []
for sib in span.next_siblings: for sib in span.next_siblings:
# Stop when next book section begins
if getattr(sib, "name", None) == "span": if getattr(sib, "name", None) == "span":
break break
@ -83,22 +97,21 @@ class BookScraper:
if hasattr(sib, "get_text") if hasattr(sib, "get_text")
else str(sib).strip() else str(sib).strip()
) )
if text: if text:
parts.append(text) parts.append(text)
self.book_description = "\n".join(parts) self.book_description = clean_text("\n".join(parts), self.site.replacements)
log_debug( log_debug(f"[BookScraper] Description length = {len(self.book_description)}")
f"[BookScraper] Description length = {len(self.book_description)} characters"
)
# ------------------------------------------------------------ # ------------------------------------------------------------
def _parse_cover(self, soup): def _parse_cover(self, soup):
cover = soup.find("img", src=lambda v: v and "files/article/image" in v) img = soup.find("img", src=lambda v: v and "files/article/image" in v)
if not cover: if not img:
log_debug("[BookScraper] No cover found") log_debug("[BookScraper] No cover found")
return return
self.cover_url = urljoin(self.site.root, cover.get("src")) self.cover_url = urljoin(self.site.root, img.get("src"))
log_debug(f"[BookScraper] Cover URL = {self.cover_url}") log_debug(f"[BookScraper] Cover URL = {self.cover_url}")
# ------------------------------------------------------------ # ------------------------------------------------------------
@ -108,13 +121,13 @@ class BookScraper:
"html > body > div:nth-of-type(6) > div:nth-of-type(2) > div > table" "html > body > div:nth-of-type(6) > div:nth-of-type(2) > div > table"
) )
href = node.select_one("a").get("href") href = node.select_one("a").get("href")
url = urljoin(self.site.root, href) chapter_url = urljoin(self.site.root, href)
parsed = urlparse(url) # base for chapter links
bp = parsed.path.rsplit("/", 1)[0] + "/" parts = chapter_url.rsplit("/", 1)
self.chapter_base = f"{parsed.scheme}://{parsed.netloc}{bp}" self.chapter_base = parts[0] + "/"
return self._fetch(url) return self._fetch(chapter_url)
# ------------------------------------------------------------ # ------------------------------------------------------------
def parse_chapter_links(self, soup): def parse_chapter_links(self, soup):
@ -136,8 +149,3 @@ class BookScraper:
idx += 1 idx += 1
log_debug(f"[BookScraper] Found {len(self.chapters)} chapters") log_debug(f"[BookScraper] Found {len(self.chapters)} chapters")
# ------------------------------------------------------------
def get_chapter_list(self):
"""Return the chapter list (DownloadController reads this)."""
return self.chapters

Loading…
Cancel
Save