On March 5, 1950, as the young state of Israel was struggling with waves of immigrants and under a severe austerity policy, Prime Minister David Ben-Gurion wrote to the then Minister of Treasury, Eliezer Kaplan:
[…] Above all we are obliged to redeem Hebrew Literature. There are thousands of manuscripts scattered in various libraries, left there, neglected. Only a few were printed […] but the majority […] have not been published. Many manuscripts were lost due to the troubles of time and the atrocities of enemies. Who knows how many were extinct during WW II, intentionally or coincidentally. I do not see a possibility to get hold of the original Manuscripts bring them all to Israel, but photos […] are as valuable as the original manuscripts, and that we should do, immediately […]
David Ben Gurion realized his vision, and established the Institution for Photocopies of Hebrew Manuscripts in the Israel National Library, where scholars from all over the world have access to every known Hebrew Manuscript. Tikkoun-Sofrim belongs to a pipeline of related projects which aspire to transform this vision to the standards of the 21st century, making all Hebrew Manuscripts available both as image and as digitized texts, both for scholars and for the public. This pipeline includes the digitization of images and their web availability (Ktiv in the Israel National Library, https://web.nli.org.il/sites/nlis/en/manuscript and the Friedberg Project for fragments of the Cairo Geniza (https://fgp.genizah.org/)) and projects aimed at the systematic production of digitized texts with the help of the crowd. Following a previous trial in crowdsourcing for manual transcription of Cairo Genizah fragments (https://www.scribesofthecairogeniza.org/), we turned in this project to combine machine learning based automatic transcription with crowdsourcing. The transcribed texts may be used in combined text+images viewers, such as those developed in the Sofer Mahir project (https://sofermahir.hypotheses.org/) for the canonical rabbinic text of the Mishnah, as well as for big-data and distant reading analysis.
The workflows, algorithms and models developed in this project can be applied to any corpus of handwritten texts, as recently demonstrated in eScriptorium, which provide a generalized GUI for digital recognition of handwritten documents using machine learning techniques (https://www.escriptorium.fr/).
In order to realize our goal of making handwritten documents more accessible to a wider audience, we need to ensure correct transcription of the handwritten manuscripts. The Tikkoun Sofrim project, implements a framework which includes the following stages:
- Handwritten text recognition (HTR) based automatic transcription of manuscripts
- Crowdsourcing platform for correction of automatic transcription.
- Aggregation of crowd-source data to produce a recommended agreed text.
- Enveloping the texts with metadata through XML-TEI and adequate (Canonical Text Service (CTS).
- Sharing the text in single mode (i.e. library viewers presenting a single manuscript in a correlated image and text mode) and multiple
mode (critical digital edition presenting the variety of MS of a given work, with additional data layers such as references, parallel sources etc.).
The website focuses on the first three stages of which the project has reached, the first stage of HTR, followed by a crowd-sourcing second stage of manual correction of the automatic transcription and the third stage of aggregating the results. Recently we presented a model for implementation of the fourth stage . As a case study, the Hebrew Midrash Tanchuma manuscripts were selected.
Unlike previous efforts, Tikkoun Sofrim is (to the best of our knowledge) the first project aimed at combining HTR with crowdsourcing CATTI for correcting HTR for ongoing transcription of complete literary documents. It is also the first and so far only one to be fully Open Source with regard to both the machine learning and user interface components.