|
|
|
|
@ -2,5 +2,109 @@
|
|
|
|
|
|
|
|
|
|
find . \
|
|
|
|
|
-path "./output" -prune -o \
|
|
|
|
|
-path "*/__pycache__" -prune -o \
|
|
|
|
|
-print | sed -e 's;[^/]*/; |;g;s;|;|--;'
|
|
|
|
|
-path "_/**pycache**" -prune -o \
|
|
|
|
|
-print | sed -e 's;[^/]_/; |;g;s;|;|--;'
|
|
|
|
|
|
|
|
|
|
# CHATGPT_CONTEXT.md
|
|
|
|
|
|
|
|
|
|
## 📘 Project Context — BookScraper (Celery Pipeline)
|
|
|
|
|
|
|
|
|
|
**Status: 1 december 2025**
|
|
|
|
|
|
|
|
|
|
Dit document brengt ChatGPT instant up-to-speed bij een nieuwe sessie.
|
|
|
|
|
Het beschrijft de actuele architectuur, flow, directorylayout en pending features.
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## ✅ Systeemarchitectuur (huidige implementatie)
|
|
|
|
|
|
|
|
|
|
Het project is een **gedistribueerde Chinese webnovel-scraper** met:
|
|
|
|
|
|
|
|
|
|
- **Flask Web GUI** (`web`)
|
|
|
|
|
- **Celery workers** per taaktype:
|
|
|
|
|
- `scraping` → boek metadata & chapterlist
|
|
|
|
|
- `controller` → pipelines dispatchen
|
|
|
|
|
- `download` → HTML binnenhalen
|
|
|
|
|
- `parse` → content extraheren + cleanen + hoofdstukheader maken
|
|
|
|
|
- `save` → tekst wegschrijven naar disk + volume-indeling
|
|
|
|
|
- `audio` → (later) m4b generation
|
|
|
|
|
- **Redis** als broker en backend
|
|
|
|
|
|
|
|
|
|
Alle workers hebben dezelfde codebase gemount via:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
volumes:
|
|
|
|
|
- .:/app
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 📂 Directorystructuur (relevant deel)
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
bookscraper/
|
|
|
|
|
│
|
|
|
|
|
├── scraper/
|
|
|
|
|
│ ├── tasks/
|
|
|
|
|
│ │ ├── scraping.py
|
|
|
|
|
│ │ ├── controller_tasks.py
|
|
|
|
|
│ │ ├── download_tasks.py
|
|
|
|
|
│ │ ├── parse_tasks.py
|
|
|
|
|
│ │ ├── save_tasks.py
|
|
|
|
|
│ │ └── pipeline.py
|
|
|
|
|
│ ├── download_controller.py
|
|
|
|
|
│ ├── utils.py
|
|
|
|
|
│ ├── sites.py
|
|
|
|
|
│ ├── book_scraper.py
|
|
|
|
|
│
|
|
|
|
|
├── docker/
|
|
|
|
|
│ ├── Dockerfile.scraper
|
|
|
|
|
│ ├── Dockerfile.web
|
|
|
|
|
│ └── Dockerfile.audio
|
|
|
|
|
│
|
|
|
|
|
├── docker-compose.yml
|
|
|
|
|
└── CHATGPT_CONTEXT.md
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🔄 Book Pipeline (zoals nu werkt)
|
|
|
|
|
|
|
|
|
|
1. **start_scrape_book(url)**
|
|
|
|
|
2. **launch_downloads(scrape_result)**
|
|
|
|
|
3. **download_chapter(num, url)**
|
|
|
|
|
4. **parse_chapter(download_result)**
|
|
|
|
|
5. **save_chapter(parsed, volume_path)**
|
|
|
|
|
|
|
|
|
|
Volumes worden correct weggeschreven naar:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
BOOKSCRAPER_OUTPUT_DIR=/Users/peter/Desktop/books
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🚀 Features op de planning
|
|
|
|
|
|
|
|
|
|
1. download retry + delay +429 pause
|
|
|
|
|
1. Cover downloaden
|
|
|
|
|
1. Make scripts
|
|
|
|
|
1. Audio pipeline integratie
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🧵 Afspraak
|
|
|
|
|
|
|
|
|
|
**ChatGPT moet altijd eerst om bestaande code vragen voordat het wijzigingen voorstelt.**
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
## 🎯 Samenvatting
|
|
|
|
|
|
|
|
|
|
Je hebt nu:
|
|
|
|
|
|
|
|
|
|
✔ Een schaalbare, modulaire Celery pipeline
|
|
|
|
|
✔ Volume-splitting
|
|
|
|
|
✔ Headers voor hoofdstuk 1
|
|
|
|
|
✔ Correcte I/O paths met host mount
|
|
|
|
|
✔ Stabiele end-to-end scraping flow
|
|
|
|
|
|