6  HTR/OCR software

Within the context of text recognition and analysis there are number of commercial and open-source options available. Below I will list the most common frameworks and some of their advantages and disadvantages. An in-depth discussion on how best to chose a framework within the context of your project is given in the next chapter (REFERENCE).

6.1 Commercial

6.1.1 Transkribus

A dominant player in the transcription of historical texts is the Transkribus platform. This platform provides a way to apply layout detection, text transcription and custom model training (with on platform generated ground truth data) without coding. It offers commercial support options and a growing community of users, including their shared model zoo. The platform is currently built around the PyLaia (python) library (also below).

Pro: Con:
user friendly expensive
support / documentation vendor lock-in
allows custom model training
model sharing

6.1.2 Google / Amazon / Microsoft APIs

All three big tech platforms offer OCR based application programming interfaces (APIs) which you can access from (python) scripts.

In particular, HTR/OCR is covered by:

Increasingly there is a consolidation of these toolboxes into (multi-modal) Generative Pre-trained Transformer (GPT, as in ChatGPT) based models. These models will provide impressive results on common tasks, but will not perform well on less common or more complex data. Their advantage often lies in their large training corpus.

Pro: Con:
support / documentation vendor lock-in
scalability requires programming
relatively cheap custom model training is often complex or not possible

6.2 Open source

6.2.1 eScriptorium

eScriptorium is a software platform created to make text layout analysis and recognition easy. The underlying text recognition is based on the Kraken framework, for which it serves as an interface. The interface allows for the user to annotate and train custom models, with no coding required, similar to Transkribus. Despite providing much the same features as Transkribus, eScriptorium is not program as such, but a service to be run on a server or in a docker image. This does require knowledge on how to setup and manage docker instances, or do a full server install. Good introductions to the use eScriptorium are provided through the standard documentation and a course by the University of Mannheim.

Pro: Con:
user friendly complex installation for novices
OK documentation
full workflow control
interoperability
shared models

6.2.1.1 Installation & Use

A basic docker install is provided on the project code pages.

6.2.2 ArkIndex

ArkIndex is a document processing platform similar to Transkribus. More so, the this open-source platform is made by the company, Teklia, behind the PyLaia library underpinning most of Transkribus. In therefore offers the same functionality with a different interface.

6.2.2.1 Installation & Use

A basic docker install is provided on the project documentation pages.

6.2.3 OCR4all

OCR4all is an OCR platform built around the Calamari text recognition engine and the LAREX layout analysis tool. Similar to eScriptorium and Transkribus it aims at making the transcription of documents easy, without the need for coding. Similar to eScriptorium the setup is not program as such, but a service to be run on a server or in a docker image.

Pro: Con:
user friendly complex installation for novices
OK documentation
full workflow control
interoperability
shared models

6.2.3.1 Installation & Use

The software runs as a docker service and can be installed using the following command:

sudo docker run -p 1476:8080 \
    -u `id -u root`:`id -g $USER` \
    --name ocr4all \
    -v $PWD/data:/var/ocr4all/data \
    -v $PWD/models:/var/ocr4all/models/custom \
    -it uniwuezpd/ocr4all

6.2.4 Tesseract

Tesseract is a popular open-source OCR program, originally created by Google but now maintained by the open-source community. Out of the box Tesseract does not allow for handwritten text recognition as the included models are not trained on handwritten data.

However, the software does allow for the retraining of models. Having been a mainstay in OCR work in the open source community a zoo of third party software providing interfaces and additional functionality exists, as well as a python interface (pytesseract) to make data processing easier.

6.2.5 Custom pipelines and libraries

Most of the above mentioned software options are mature and require limited coding knowledge to operate. However, I would be amiss to not mention the underlying HTR/OCR programming libraries. Depending on the use case one could benefit from using low level libraries, rather than more user friendly platforms (built around them). Most prominent python libraries for HTR/OCR work are Kraken as used by eScriptorium, PyLaia used by Transkribus, EasyOCR and PaddleOCR. Other software libraries to mention are YOLO and doc-UFCN which both cover layout and text detection needs.

All these libraries provide machine learning setups to train handwritten text recognition models of the CNN + LSTM/RNN + CTC kind. In addition, Kraken and PaddleOCR provide document layout analysis (segmentation) options.

Pro: Con:
flexible complex installation
full workflow control coding required