Text digitization, recognition and analysis

Author

Koen Hufkens

Published

January 7, 2025

Preface

These are the materials for the course “Text recognition and analysis, 6-7 Feb. 2025” at the Leibniz-Institut für Europäische Geschichte (IEG), Mainz, with lessons learned from from the extension of the Congo basin eco-climatological data recovery and valorisation (COBECORE) project research efforts with the Free University Brussels, Belgium. This document will serve as a (personal) reference and as a general introduction for all things Handwritten Text Recognition/Optical Character Recognition (HTR/OCR).

This reference gives an overview of the most common tools and (data) pitfalls for historical (handwritten) text recognition, but it can be applied elsewhere, too. In addition, I will also briefly discuss the initial digitization and potential citizen science components of such projects, leveraging my experience leading the COBECORE project. I will discuss the practical issues of text recognition projects and how to resolve them efficiently and cost-effectively. This reference is a practical tool, not an introduction to machine learning. This reference will give you guidance on what it takes to start, and complete, a text recognition and analysis effort.

Note

This is not a machine learning reference! Drastic simplifications are made, using analogies, for the sake of clarity due to the interdisciplinary nature of transcription projects. Computer and data science majors might find these references “wrong”. However, the goals is not to be mathematically correct, but to communicate the processes of a text recognition project broadly.