ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records

The objective of ICDAR 2019 HDRC Chinese is to recognize and analyze the layout, and finally detect and recognize the textlines and characters of the large historical document collection containing more than 20.000 pages kindly provided by FamilySearch.

FamilySearch-DB is a collection of Chinese manuscripts that have been chosen regarding the complexity of their layout in semantic structure and font. All manuscripts are annotated using Aletheia, an advanced system for accurate and yet cost-effective ground truthing of large amounts of documents. The annotation of the manuscripts are available in PAGE XML format, a sophisticated XML schema which is component of the PAGE (Page Analysis and Ground truth Elements) Format Framework.

We propose 3 different tasks for this competition:
Task 1: Handwritten Character Recognition on extracted textlines
Task 2: Layout Analysis on structured historical document images
Task 3: Complete, integrated textline detection and recognition on a large dataset

ICDAR 2019 HDRC Chinese is organised by the EISLAB-Machine Learning group at LTU.


Click here to send your registration by email and receive the trainind data.

Evaluation tools

Task 1: Available for registered participants
Task 2: Available on GitHub
Task 3: Available for registered participants

Referencing the Data

In any research publication or communication about performance results (e.g., on blogs or news articles) the source of the data has to be referred to, i.e., the ICDAR 2019 HDRC-Chinese dataset, as following:

  archivePrefix = {arXiv},
  arxivId = {1903.03341},
  eprint = {1903.03341},
  author = {Simistira Liwicki, Foteini and Saini, Rajkumar and Dobson, Derek and Morrey, Jon and Liwicki, Marcus},
  title = {{ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records}},
  booktitle={{to appear in 15th International Conference on Document Analysis and Recognition (ICDAR)}},  
  year = {2019}
  month = {mar},


Frequently Asked Questions (FAQ)

Q1. For some images, the bounding boxes strictly include only text areas, while for some other images, they also contain a large amount of empty space with no text. Which kind of labeling is more advisable?
A1. The huge whitespaces will not harm the recognition scores – so your algorithm maybe be more neat or not, as you prefer. The annotations were made by human domain experts and unfortunately they sometimes differ. In the evaluation for Task 2, for example, we will not take background regions within the boxes into account; for Task 3, we will only focus on the correct transcription.

Q2. There are different annotations for the space character. Which one should we use?
A2. During evaluation, we will map all space characters to a single space character (32), so it does not matter what you report.



Foteini Simistira Liwicki 
Rajkumar Saini
Marcus Liwicki


email: hdrc2019@ltu.se

Follow us on twitter @hdrc2019