How We Process Books

From scanned page images to searchable, translated text. Every step is automated, auditable, and open.

Every book in Source Library passes through an automated pipeline that reads, translates, and enriches historical texts. The original source material is always preserved. AI processing adds layers of accessibility on top, never replacing what came before.

35k
Books
3.7M
Pages transcribed
3.4M
Pages translated
6.3k
Fully processed

The Pipeline

1
Import
2
Archive
3
OCR
4
Translation
5
Enrichment
6
Chapters
7
Images
8
Publication
1

Import

Books arrive from digital libraries worldwide

We import from over 20 digital library sources worldwide, including Internet Archive, Gallica (BnF), the Bavarian State Library, Wellcome Collection, e-rara, and university collections across Europe and Asia. Each import fetches metadata and high-resolution page images via IIIF manifests.

2

Archive

Images preserved on our infrastructure

Original page images are archived from external sources to our own storage, ensuring long-term availability. Original URLs are always preserved for provenance. Thumbnails are generated for fast browsing.

3

OCR

AI reads historical typefaces and handwriting

Gemini vision models extract text from page images, handling blackletter (Fraktur), early modern Latin abbreviations, ligatures, and multi-column layouts. A universal prompt calibrated across scripts and languages handles everything from Latin to Arabic to Sanskrit. Pages are classified by type (text, illustration, title page, table of contents).

4

Translation

Latin, German, and other languages rendered into English

Pages are translated sequentially so the AI can maintain consistent terminology and handle sentences that cross page boundaries. English books from before 1700 are modernized from Early Modern English instead. The original language text is always preserved alongside the translation.

5

Enrichment

Summary, index, and metadata generated

AI generates a reading summary, extracts an index of people, places, concepts, and key terms, identifies the book's language and subject categories, and writes a scholarly description. This makes every book searchable and browsable.

6

Chapters

Structural divisions identified

Chapter and section headings are extracted from the OCR text and linked to specific pages, creating a navigable table of contents for the reader.

7

Images

Illustrations and emblems detected and cataloged

AI vision scans every page for illustrations, emblems, diagrams, and decorative elements. Each detection includes bounding box coordinates, a description, subject tags, and a museum-style label. High-quality detections appear in the gallery.

8

Publication

Scholarly editions with DOIs

Completed books can be published as citable scholarly editions with DOIs minted through Zenodo. Each edition is an immutable snapshot with generated front matter, contributor attribution, and exports in multiple formats.

Quality & Provenance

Every step in the pipeline is logged, versioned, and auditable. We treat these texts as cultural heritage. Processing should be transparent, not a black box.

Prompt versioning

OCR and translation prompts are stored as immutable versions in our database. Every page records which prompt version produced its text, so results can be compared across versions and reprocessed with improvements.

Original preservation

The original page image URL is never overwritten. Archived copies, cropped versions, and thumbnails are layered on top. If our processing ever introduces errors, the source material is always available.

Audit trail

Every AI call is logged with model, token count, cost, and result status. Admin actions, metadata changes, and pipeline state transitions all feed into a per-book history timeline visible on each book's page.

Snapshot protection

When a page has been manually edited by a human, reprocessing automatically creates a backup snapshot first. Manual corrections are never silently overwritten by automation.

Universal OCR prompt

A single calibrated prompt handles all scripts and languages — from Latin and Fraktur to Arabic, Sanskrit, and Armenian. Trained on thousands of historical pages to handle abbreviations, ligatures, and period-appropriate conventions.

Open standards

Every book publishes a IIIF manifest. Translations follow W3C Web Annotation conventions. Scholarly editions receive DOIs via Zenodo. Data is accessible via API and MCP server.

Collection

Languages

Visual13272 books
Latin3349 books
German1772 books
English1260 books
French856 books
Greek725 books
Chinese563 books
Dutch455 books
Sumerian377 books
Sanskrit363 books

Sources

Wikimedia Commons19981 books
Internet Archive5210 books
Bibliotheca Philosophica Hermetica2227 books
Rijksmuseum1887 books
CMC Prins Frederik — Bibliotheca Klossiana794 books
National Gallery of Art590 books
Münchener DigitalisierungsZentrum (Bavarian State Library)583 books
The Metropolitan Museum of Art488 books

See It in Action

Every book page shows the original scan alongside its transcription and translation. Book detail pages include a history timeline showing every processing step with timestamps, models used, and costs.

This library is built in the open.

If you spot an error, have a suggestion, or just want to say hello — we’d love to hear from you.