4  HTR/OCR software and data

Within the context of text recognition and analysis there are number of commercial and open-source options available. Below I will list the most common frameworks and some of their advantages and disadvantages. An in-depth discussion on how best to chose a framework within the context of your project is given in the next chapter (REFERENCE).

4.1 Commercial

4.1.1 Transkribus

A dominant player in the transcription of historical texts is the Transkribus platform. This platform provides a way to apply layout detection, text transcription and custom model training (with on platform generated ground truth data) without coding. It offers commercial support options and a growing community of users, including their shared model zoo. The platform is currently built around the PyLaia (python) library (also below).

Pro: Con:
user friendly expensive
support / documentation vendor lock-in
allows custom model training
model sharing

4.1.2 Google / Amazon / Microsoft APIs

All three big tech platforms offer OCR based application programming interfaces (APIs) which you can access from (python) scripts.

In particular, HTR/OCR is covered by:

Increasingly there is a consolidation of these toolboxes into (multi-modal) Generative Pre-trained Transformer (GPT, as in ChatGPT) based models. These models will provide impressive results on common tasks, but will not perform well on less common or more complex data. Their advantage often lies in their large training corpus.

Pro: Con:
support / documentation vendor lock-in
scalability requires programming
relatively cheap custom model training is often complex or not possible

4.2 Open source

4.2.1 eScriptorium

eScriptorium is a software platform created to make text layout analysis and recognition easy. The underlying text recognition is based on the Kraken framework, for which it serves as an interface. The interface allows for the user to annotate and train custom models, with no coding required, similar to Transkribus. Despite providing much the same features as Transkribus, eScriptorium is not program as such, but a service to be run on a server or in a docker image. This does require knowledge on how to setup and manage docker instances, or do a full server install. Good introductions to the use eScriptorium are provided through the standard documentation and a course by the University of Mannheim.

Pro: Con:
user friendly complex installation for novices
OK documentation
full workflow control
interoperability
shared models

4.2.1.1 Installation & Use

A basic docker install is provided on the project code pages.

4.2.2 ArkIndex

ArkIndex is a document processing platform similar to Transkribus. More so, the this open-source platform is made by the company, Teklia, behind the PyLaia library underpinning most of Transkribus. In therefore offers the same functionality with a different interface.

4.2.2.1 Installation & Use

A basic docker install is provided on the project documentation pages.

4.2.3 OCR4all

OCR4all is an OCR platform built around the Calamari text recognition engine and the LAREX layout analysis tool. Similar to eScriptorium and Transkribus it aims at making the transcription of documents easy, without the need for coding. Similar to eScriptorium the setup is not program as such, but a service to be run on a server or in a docker image.

Pro: Con:
user friendly complex installation for novices
OK documentation
full workflow control
interoperability
shared models

4.2.3.1 Installation & Use

The software runs as a docker service and can be installed using the following command:

sudo docker run -p 1476:8080 \
    -u `id -u root`:`id -g $USER` \
    --name ocr4all \
    -v $PWD/data:/var/ocr4all/data \
    -v $PWD/models:/var/ocr4all/models/custom \
    -it uniwuezpd/ocr4all

4.2.4 Tesseract

Tesseract is a popular open-source OCR program, originally created by Google but now maintained by the open-source community. Out of the box Tesseract does not allow for handwritten text recognition as the included models are not trained on handwritten data.

However, the software does allow for the retraining of models. Having been a mainstay in OCR work in the open source community a zoo of third party software providing interfaces and additional functionality exists, as well as a python interface (pytesseract) to make data processing easier.

4.2.5 Custom pipelines and libraries

Most of the above mentioned software options are mature and require limited coding knowledge to operate. However, I would be amiss to not mention the underlying HTR/OCR programming libraries. Depending on the use case one could benefit from using low level libraries, rather than more user friendly platforms (built around them). Most prominent python libraries for HTR/OCR work are Kraken as used by eScriptorium, PyLaia used by Transkribus, EasyOCR and PaddleOCR. Other software libraries to mention are YOLO and doc-UFCN which both cover layout and text detection needs.

All these libraries provide machine learning setups to train handwritten text recognition models of the CNN + LSTM/RNN + CTC kind. In addition, Kraken and PaddleOCR provide document layout analysis (segmentation) options.

Pro: Con:
flexible complex installation
full workflow control coding required

4.3 Data

Methodologically (see above) the problem of text transcription seems to be solved, with various software solutions available. So, what is holding back (universal) open-source HTR/OCR? Generally, data is what holds back HTR in practice.

Given the many variations in handwritten text ML algorithms need to be trained (“see”) a wide variety of handwritten text characters to be able to firstly translate similarly styled handwritten text, secondly potentially apply this to other adjacent styles. How close two documents are in writing style determines how well a trained model will perform on this task.

Consequently, the more variations in handwritten text styles you train an ML algorithm on the easier it will be to transcribe a wide variety of text styles. In short, the bottleneck in automated transcription is gathering sufficient training data (for your use case). Although the ML code might be open-source many large training datasets are not always shared as generously. It can be argued that within the context of FAIR research practices ML code disseminated without the training data, or model parameters, for a particular study is decidedly not open-source. A similar argument has been made within the context of the recent flurry of supposedly open-source Large Language Models (LLMs), such as ChatGPT.

The lack of access to both the training data, or a pre-trained model, limits the re-use of the model in a new context. One can not take a model and fine-tune it, i.e. let it “see” new text styles. In short, if you only have the underlying model code you always have to train a model from scratch (anew) using your own, often limited, dataset. This context is important to understand, as this is how transcription platforms will keep you tied to their paying service. It is important to understand these dynamics.

Increasingly the need to share data and models openly has come into focus. For example HTR United is an ininiative to collect various HTR/OCR transcription datasets using a common meta-data scheme to break this pattern.