Pipeline Architecture
How 4,800+ historical books flow from scanned images to searchable, translated text — the full technical picture.
Processing Flow
Each book passes through 10 stages. Two crons orchestrate the pipeline every 10 minutes. Click any stage to expand its details.
AI workers never write directly to MongoDB — prevents connection storms during large batch jobs
Safety Mechanisms
Backpressure
Each stage has hard caps on concurrent jobs (50 Lambda OCR, 100 Lambda translation, 200 Batch API, 10 image extraction). The cron skips submission when limits are hit.
Staleness Detection
Books stuck in submitted/in-progress states for 48+ hours get rolled back to the previous stage. Zombie jobs (processing >24h) are force-completed.
Emergency Stop
Selective phase pausing via system_config. Both submission AND completion phases are guarded — in-flight work can't cascade through paused stages.
Circuit Breakers
3+ consecutive batch failures trigger automatic Lambda fallback. Quota exhaustion (HTTP 429) immediately switches backend. OCR loops capped at 3 retries.
Non-blocking Enrichment
Metadata, FT check, summary, index, and chapters are non-critical: persistent failures skip ahead rather than stalling the entire pipeline.
Write Queue Isolation
AI workers (600+ concurrent) never write to MongoDB directly. Results flow through an SQS write queue to a Writer Lambda capped at 50 concurrent instances.
Special Behaviors
English Modernization
English books from before 1700 are modernized (Early Modern English to Modern English) instead of translated. The output is stored in the same translation.data field, so all downstream processing works identically.
Split Detection
Digitized books often have two-page spreads scanned as a single image. On import, the system samples pages and uses aspect ratio analysis (< 0.9 = single, > 1.3 = spread) to flag books that need splitting. Crop coordinates (0-1000 scale) are computed for each half. Heuristic, ML, and Gemini vision methods exist for per-page refinement.
FIFO Context Chain
Translation uses an SQS FIFO queue — pages process in order per book. Each Lambda invocation fetches the previous page's translation for terminology consistency and sentence continuity. Batch API is never used for translation.
Page Revisions
Every version of OCR and translation text is preserved in the page_revisions collection — AI, batch, manual, contributor. If a page was manually edited, re-processing creates a backup snapshot first.
Multi-Column Rendering
OCR prompts detect multi-column layouts (<columns>N</columns> metadata + <column-break/> inline markers). The reader renders these as CSS grid layouts, with a fallback midpoint split when only the metadata tag exists.
Cost per Book
Based on gemini-3-flash-preview pricing. A typical 300-page book costs roughly $1.50 to fully process through all stages.
| Step | Cost/page | 300-page book |
|---|---|---|
| OCR (Lambda) | $0.0023 | $0.68 |
| OCR (Batch API) | $0.0011 | $0.34 |
| Translation | $0.0022 | $0.66 |
| Summary + Index | — | $0.04 |
| Chapter Extraction | — | $0.01 |
| Image Extraction | $0.0016 | ~$0.05 |
| Metadata + FT Check | — | $0.008 |
Processing History
The pipeline above describes the current architecture. Books processed earlier went through different models, prompts, and workflows. Most books are being gradually reprocessed to current standards as capacity allows.
| Period | What changed | Impact |
|---|---|---|
| Dec 2025 | Initial pipeline: manual imports, basic OCR with gemini-2.0-flash, no automated orchestration | First ~500 books processed manually |
| Jan 2026 | Split detection for two-page spreads: cascade of heuristic pixel analysis → ML model → Gemini vision. Computes crop coordinates (0-1000 scale) for each half | Digitized books with facing pages render correctly |
| Jan 2026 | Image archiving moved to dedicated Hetzner server. Pages archived from external sources (IA, Gallica, MDZ) to Vercel Blob with thumbnail generation | Long-term image availability, faster page loads |
| Jan 2026 | Auto pipeline cron introduced: books flow through stages automatically every 10 minutes instead of manual triggering | Fully automated processing for new imports |
| Jan – Feb 2026 | Upgraded OCR from gemini-2.0/2.5-flash to gemini-3-flash-preview. Prompt evolved through v1-v6, adding page-type classification, multi-column detection, image bounding boxes | ~250k+ pages on current quality, ~75k older pages being reprocessed |
| Feb 2026 | English modernization: pre-1700 English books get Early Modern → Modern English instead of translation. Output stored in same field, so all downstream processing works identically | ~200 English books modernized automatically |
| Feb 2026 | Multi-column rendering: OCR detects column layouts (<columns>N</columns> + <column-break/> markers), reader renders as CSS grid | Two-column Renaissance books display correctly |
| Feb 2026 | Write Queue architecture: AI workers (600+ concurrent) no longer write directly to MongoDB. Results flow through SQS → Writer Lambda (50 max concurrency) | Eliminated connection storms during large batches |
| Feb 2026 | Translation switched to Lambda FIFO queue only (Batch API retired for translation). Each page receives the previous page's translation as context for terminology consistency | Better translation quality, ~30k stale translations being redone |
| Feb 2026 | Added Metadata Enrichment (language, categories, display title, source work dates) and First Translation Check stages to the pipeline | ~280 confirmed first English translations identified |
| Feb 2026 | Chapter extraction and enrichment (summary + index) split into dedicated cron so they don't starve translation of time budget | More reliable enrichment, no pipeline stalls |
| Mar 2026 | Image extraction filters by page type — only pages classified as illustration, diagram, map, frontispiece, or mixed are scanned | ~80-90% cost reduction for image extraction |
Beyond the Pipeline
Once books reach “complete,” several downstream systems build on the processed data.
Scholarly Editions & DOI
Completed translations can be published as citable scholarly editions with DOIs via Zenodo. Each edition is an immutable snapshot with content hash, AI-generated introduction and methodology, contributor tracking (AI + human), and citations in APA and BibTeX formats. Versioning is supported — republishing creates a new version.
Gallery
Extracted illustrations, emblems, diagrams, and engravings from all books form a browsable gallery with AI-generated museum-style descriptions, subject tags, bounding boxes, and quality scores. Gallery images feed the social media system for automated tweet generation.
Encyclopedia
AI-generated book indexes (people, places, concepts) are aggregated into an encyclopedia with cross-references across the entire library. Entity pages show every book that mentions a person or concept, with page-level links.
First Translation Identification
A two-stage verification system identifies books that are the first known English translation of a historical text. Stage 1 (AI metadata enrichment) classifies during processing. Stage 2 (LLM deep knowledge check) verifies against known translations, academic publishers, and dissertations.
GitHub Sync
On completion, full book text (OCR + translations) is synced to a public GitHub repository as plain text files — a permanent, version-controlled archive independent of the web application and database.
MCP Server & API
The full library is accessible via a Model Context Protocol server (for AI assistants) and a REST API. Seven tools let AI models search books, read translations, and find illustrations across 4,800+ historical texts.
Automation & Human Review
The pipeline is a stigmergic system — each safety mechanism, backpressure limit, and error handler is a trace left by a past failure that shapes future processing. The environment itself encodes intelligence: books flow through paths carved by previous experience, with human judgment required only at the boundaries.