How We Process Books
From scanned page images to searchable, translated text. Every step is automated, auditable, and open.
Every book in Source Library passes through an automated pipeline that reads, translates, and enriches historical texts. The original source material is always preserved. AI processing adds layers of accessibility on top, never replacing what came before.
The Pipeline
Import
Books arrive from digital libraries worldwide
We import from over 20 digital library sources worldwide, including Internet Archive, Gallica (BnF), the Bavarian State Library, Wellcome Collection, e-rara, and university collections across Europe and Asia. Each import fetches metadata and high-resolution page images via IIIF manifests.
Archive
Images preserved on our infrastructure
Original page images are archived from external sources to our own storage, ensuring long-term availability. Original URLs are always preserved for provenance. Thumbnails are generated for fast browsing.
OCR
AI reads historical typefaces and handwriting
Gemini vision models extract text from page images, handling blackletter (Fraktur), early modern Latin abbreviations, ligatures, and multi-column layouts. A universal prompt calibrated across scripts and languages handles everything from Latin to Arabic to Sanskrit. Pages are classified by type (text, illustration, title page, table of contents).
Translation
Latin, German, and other languages rendered into English
Pages are translated sequentially so the AI can maintain consistent terminology and handle sentences that cross page boundaries. English books from before 1700 are modernized from Early Modern English instead. The original language text is always preserved alongside the translation.
Enrichment
Summary, index, and metadata generated
AI generates a reading summary, extracts an index of people, places, concepts, and key terms, identifies the book's language and subject categories, and writes a scholarly description. This makes every book searchable and browsable.
Chapters
Structural divisions identified
Chapter and section headings are extracted from the OCR text and linked to specific pages, creating a navigable table of contents for the reader.
Images
Illustrations and emblems detected and cataloged
AI vision scans every page for illustrations, emblems, diagrams, and decorative elements. Each detection includes bounding box coordinates, a description, subject tags, and a museum-style label. High-quality detections appear in the gallery.
Publication
Scholarly editions with DOIs
Completed books can be published as citable scholarly editions with DOIs minted through Zenodo. Each edition is an immutable snapshot with generated front matter, contributor attribution, and exports in multiple formats.
Quality & Provenance
Every step in the pipeline is logged, versioned, and auditable. We treat these texts as cultural heritage. Processing should be transparent, not a black box.
Prompt versioning
OCR and translation prompts are stored as immutable versions in our database. Every page records which prompt version produced its text, so results can be compared across versions and reprocessed with improvements.
Original preservation
The original page image URL is never overwritten. Archived copies, cropped versions, and thumbnails are layered on top. If our processing ever introduces errors, the source material is always available.
Audit trail
Every AI call is logged with model, token count, cost, and result status. Admin actions, metadata changes, and pipeline state transitions all feed into a per-book history timeline visible on each book's page.
Snapshot protection
When a page has been manually edited by a human, reprocessing automatically creates a backup snapshot first. Manual corrections are never silently overwritten by automation.
Universal OCR prompt
A single calibrated prompt handles all scripts and languages — from Latin and Fraktur to Arabic, Sanskrit, and Armenian. Trained on thousands of historical pages to handle abbreviations, ligatures, and period-appropriate conventions.
Open standards
Every book publishes a IIIF manifest. Translations follow W3C Web Annotation conventions. Scholarly editions receive DOIs via Zenodo. Data is accessible via API and MCP server.
Collection
Languages
Sources
See It in Action
Every book page shows the original scan alongside its transcription and translation. Book detail pages include a history timeline showing every processing step with timestamps, models used, and costs.