How Processing Works | Source Library

Every book in Source Library passes through an automated pipeline that reads, translates, and enriches historical texts. The original source material is always preserved. AI processing adds layers of accessibility on top, never replacing what came before.

35k

Books

3.7M

Pages transcribed

3.4M

Pages translated

6.3k

Fully processed

The Pipeline

Import

Books arrive from digital libraries worldwide

We import from over 20 digital library sources worldwide, including Internet Archive, Gallica (BnF), the Bavarian State Library, Wellcome Collection, e-rara, and university collections across Europe and Asia. Each import fetches metadata and high-resolution page images via IIIF manifests.

OCR

AI reads historical typefaces and handwriting

Gemini vision models extract text from page images, handling blackletter (Fraktur), early modern Latin abbreviations, ligatures, and multi-column layouts. A universal prompt calibrated across scripts and languages handles everything from Latin to Arabic to Sanskrit. Pages are classified by type (text, illustration, title page, table of contents).

Translation

Latin, German, and other languages rendered into English

Pages are translated sequentially so the AI can maintain consistent terminology and handle sentences that cross page boundaries. English books from before 1700 are modernized from Early Modern English instead. The original language text is always preserved alongside the translation.

Enrichment

Summary, index, and metadata generated

AI generates a reading summary, extracts an index of people, places, concepts, and key terms, identifies the book's language and subject categories, and writes a scholarly description. This makes every book searchable and browsable.

Chapters

Structural divisions identified

Chapter and section headings are extracted from the OCR text and linked to specific pages, creating a navigable table of contents for the reader.

Images

Illustrations and emblems detected and cataloged

AI vision scans every page for illustrations, emblems, diagrams, and decorative elements. Each detection includes bounding box coordinates, a description, subject tags, and a museum-style label. High-quality detections appear in the gallery.

Publication

Scholarly editions with DOIs

Completed books can be published as citable scholarly editions with DOIs minted through Zenodo. Each edition is an immutable snapshot with generated front matter, contributor attribution, and exports in multiple formats.

Quality & Provenance

Every step in the pipeline is logged, versioned, and auditable. We treat these texts as cultural heritage. Processing should be transparent, not a black box.

Prompt versioning

OCR and translation prompts are stored as immutable versions in our database. Every page records which prompt version produced its text, so results can be compared across versions and reprocessed with improvements.

Original preservation

The original page image URL is never overwritten. Archived copies, cropped versions, and thumbnails are layered on top. If our processing ever introduces errors, the source material is always available.

Audit trail

Every AI call is logged with model, token count, cost, and result status. Admin actions, metadata changes, and pipeline state transitions all feed into a per-book history timeline visible on each book's page.

Snapshot protection

When a page has been manually edited by a human, reprocessing automatically creates a backup snapshot first. Manual corrections are never silently overwritten by automation.

Universal OCR prompt

A single calibrated prompt handles all scripts and languages — from Latin and Fraktur to Arabic, Sanskrit, and Armenian. Trained on thousands of historical pages to handle abbreviations, ligatures, and period-appropriate conventions.

Open standards

Every book publishes a IIIF manifest. Translations follow W3C Web Annotation conventions. Scholarly editions receive DOIs via Zenodo. Data is accessible via API and MCP server.

Collection

Languages

Visual13272 books

Latin3349 books

German1772 books

English1260 books

French856 books

Greek725 books

Chinese563 books

Dutch455 books

Sumerian377 books

Sanskrit363 books

Sources

Wikimedia Commons19981 books

Internet Archive5210 books

Bibliotheca Philosophica Hermetica2227 books

Rijksmuseum1887 books

CMC Prins Frederik — Bibliotheca Klossiana794 books

National Gallery of Art590 books

Münchener DigitalisierungsZentrum (Bavarian State Library)583 books

The Metropolitan Museum of Art488 books

See It in Action

Every book page shows the original scan alongside its transcription and translation. Book detail pages include a history timeline showing every processing step with timestamps, models used, and costs.

Browse the Library Image Gallery API & MCP Server

How We Process Books