Pipeline Architecture

How 4,800+ historical books flow from scanned images to searchable, translated text — the full technical picture.

14k
Books
19k
In pipeline
2.6M
Pages OCR'd
1.7M
Pages translated
6.3k
Complete

Processing Flow

Each book passes through 10 stages. Two crons orchestrate the pipeline every 10 minutes. Click any stage to expand its details.

post-import-pipelineevery 10 min
Import
0
Archive
3,611
OCR
12
Metadata
0
FT Check
0
Translate
811
enrich-booksevery 10 min
Enrich
0
Chapters
0
post-import-pipelinepriority pass
Images
5,250
Complete
6,335
Write Queue Pattern
AI Workers (600+)
SQS Write Queue
Writer Lambda (50)
MongoDB Atlas

AI workers never write directly to MongoDB — prevents connection storms during large batch jobs

250 failed (3+ retries exhausted)
1562 need manual attention

Safety Mechanisms

Backpressure

Each stage has hard caps on concurrent jobs (50 Lambda OCR, 100 Lambda translation, 200 Batch API, 10 image extraction). The cron skips submission when limits are hit.

Staleness Detection

Books stuck in submitted/in-progress states for 48+ hours get rolled back to the previous stage. Zombie jobs (processing >24h) are force-completed.

Emergency Stop

Selective phase pausing via system_config. Both submission AND completion phases are guarded — in-flight work can't cascade through paused stages.

Circuit Breakers

3+ consecutive batch failures trigger automatic Lambda fallback. Quota exhaustion (HTTP 429) immediately switches backend. OCR loops capped at 3 retries.

Non-blocking Enrichment

Metadata, FT check, summary, index, and chapters are non-critical: persistent failures skip ahead rather than stalling the entire pipeline.

Write Queue Isolation

AI workers (600+ concurrent) never write to MongoDB directly. Results flow through an SQS write queue to a Writer Lambda capped at 50 concurrent instances.

Special Behaviors

English Modernization

English books from before 1700 are modernized (Early Modern English to Modern English) instead of translated. The output is stored in the same translation.data field, so all downstream processing works identically.

Split Detection

Digitized books often have two-page spreads scanned as a single image. On import, the system samples pages and uses aspect ratio analysis (< 0.9 = single, > 1.3 = spread) to flag books that need splitting. Crop coordinates (0-1000 scale) are computed for each half. Heuristic, ML, and Gemini vision methods exist for per-page refinement.

FIFO Context Chain

Translation uses an SQS FIFO queue — pages process in order per book. Each Lambda invocation fetches the previous page's translation for terminology consistency and sentence continuity. Batch API is never used for translation.

Page Revisions

Every version of OCR and translation text is preserved in the page_revisions collection — AI, batch, manual, contributor. If a page was manually edited, re-processing creates a backup snapshot first.

Multi-Column Rendering

OCR prompts detect multi-column layouts (<columns>N</columns> metadata + <column-break/> inline markers). The reader renders these as CSS grid layouts, with a fallback midpoint split when only the metadata tag exists.

Cost per Book

Based on gemini-3-flash-preview pricing. A typical 300-page book costs roughly $1.50 to fully process through all stages.

StepCost/page300-page book
OCR (Lambda)$0.0023$0.68
OCR (Batch API)$0.0011$0.34
Translation$0.0022$0.66
Summary + Index$0.04
Chapter Extraction$0.01
Image Extraction$0.0016~$0.05
Metadata + FT Check$0.008

Processing History

The pipeline above describes the current architecture. Books processed earlier went through different models, prompts, and workflows. Most books are being gradually reprocessed to current standards as capacity allows.

PeriodWhat changedImpact
Dec 2025Initial pipeline: manual imports, basic OCR with gemini-2.0-flash, no automated orchestrationFirst ~500 books processed manually
Jan 2026Split detection for two-page spreads: cascade of heuristic pixel analysis → ML model → Gemini vision. Computes crop coordinates (0-1000 scale) for each halfDigitized books with facing pages render correctly
Jan 2026Image archiving moved to dedicated Hetzner server. Pages archived from external sources (IA, Gallica, MDZ) to Vercel Blob with thumbnail generationLong-term image availability, faster page loads
Jan 2026Auto pipeline cron introduced: books flow through stages automatically every 10 minutes instead of manual triggeringFully automated processing for new imports
Jan – Feb 2026Upgraded OCR from gemini-2.0/2.5-flash to gemini-3-flash-preview. Prompt evolved through v1-v6, adding page-type classification, multi-column detection, image bounding boxes~250k+ pages on current quality, ~75k older pages being reprocessed
Feb 2026English modernization: pre-1700 English books get Early Modern → Modern English instead of translation. Output stored in same field, so all downstream processing works identically~200 English books modernized automatically
Feb 2026Multi-column rendering: OCR detects column layouts (<columns>N</columns> + <column-break/> markers), reader renders as CSS gridTwo-column Renaissance books display correctly
Feb 2026Write Queue architecture: AI workers (600+ concurrent) no longer write directly to MongoDB. Results flow through SQS → Writer Lambda (50 max concurrency)Eliminated connection storms during large batches
Feb 2026Translation switched to Lambda FIFO queue only (Batch API retired for translation). Each page receives the previous page's translation as context for terminology consistencyBetter translation quality, ~30k stale translations being redone
Feb 2026Added Metadata Enrichment (language, categories, display title, source work dates) and First Translation Check stages to the pipeline~280 confirmed first English translations identified
Feb 2026Chapter extraction and enrichment (summary + index) split into dedicated cron so they don't starve translation of time budgetMore reliable enrichment, no pipeline stalls
Mar 2026Image extraction filters by page type — only pages classified as illustration, diagram, map, frontispiece, or mixed are scanned~80-90% cost reduction for image extraction

Beyond the Pipeline

Once books reach “complete,” several downstream systems build on the processed data.

Scholarly Editions & DOI

Completed translations can be published as citable scholarly editions with DOIs via Zenodo. Each edition is an immutable snapshot with content hash, AI-generated introduction and methodology, contributor tracking (AI + human), and citations in APA and BibTeX formats. Versioning is supported — republishing creates a new version.

Gallery

Extracted illustrations, emblems, diagrams, and engravings from all books form a browsable gallery with AI-generated museum-style descriptions, subject tags, bounding boxes, and quality scores. Gallery images feed the social media system for automated tweet generation.

Encyclopedia

AI-generated book indexes (people, places, concepts) are aggregated into an encyclopedia with cross-references across the entire library. Entity pages show every book that mentions a person or concept, with page-level links.

First Translation Identification

A two-stage verification system identifies books that are the first known English translation of a historical text. Stage 1 (AI metadata enrichment) classifies during processing. Stage 2 (LLM deep knowledge check) verifies against known translations, academic publishers, and dissertations.

GitHub Sync

On completion, full book text (OCR + translations) is synced to a public GitHub repository as plain text files — a permanent, version-controlled archive independent of the web application and database.

MCP Server & API

The full library is accessible via a Model Context Protocol server (for AI assistants) and a REST API. Seven tools let AI models search books, read translations, and find illustrations across 4,800+ historical texts.

Automation & Human Review

The pipeline is a stigmergic system — each safety mechanism, backpressure limit, and error handler is a trace left by a past failure that shapes future processing. The environment itself encodes intelligence: books flow through paths carved by previous experience, with human judgment required only at the boundaries.

Fully Automated

Pipeline orchestration
Two crons advance books through all 10 stages every 10 minutes — no human trigger needed after import.
OCR, translation, image extraction
Lambda workers process pages via SQS queues. Backpressure, retries, and failure recovery are all automatic.
Metadata enrichment
AI classifies language, categories, description, display title, source work dates, and first-translation status.
Staleness & zombie detection
Books stuck for 48h get rolled back. Jobs stuck for 24h are force-completed. No alert fatigue — the system self-heals.
Gallery, search index, encyclopedia
Image extraction results flow into the gallery. Book indexes aggregate into encyclopedia entries. All automatic.
Page count sync & data integrity
Crons refresh cached counts, sync gallery metadata, and archive images on a fixed schedule.

Human-Initiated, Then Automated

Book imports
A human decides which book to import and from which source. Everything after — page creation, archiving, OCR, translation — is automatic.
Re-enrollment
Failed or completed books can be re-enrolled in the pipeline. One API call, then the cron takes over.
Emergency stop / resume
Selective phase pausing is a human decision. The system respects it at both submission and completion boundaries.
Edition publishing
A human initiates publication and chooses the license. Front matter generation, content hashing, and DOI minting are automated.

Requires Human Judgment

Curation & acquisition
Which books belong in the library? What sources to prioritize? These are scholarly decisions no AI makes.
QA audit
Comparing OCR against page images, verifying metadata against title pages, checking translation quality — still requires expert eyes.
Failed book triage
Books in "needs attention" state require a human to diagnose the problem: bad source images, corrupt metadata, import failures.
First translation review
When the AI is uncertain whether a translation exists, the "needs review" disposition flags it for a human scholar to verify.
Page corrections
Manual OCR and translation edits — fixing names, dates, or passages the AI got wrong. Revisions are preserved and protected from re-processing.
Pipeline tuning
Backpressure limits, model selection, prompt updates, cost/quality tradeoffs — the meta-decisions that shape the environment the pipeline runs in.

This library is built in the open.

If you spot an error, have a suggestion, or just want to say hello — we’d love to hear from you.