ICDAR2019-HDRC- Winners
Winning team of ICDAR2019-HDRC View original picture , opens in new tab/window

ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records

The objective of ICDAR 2019 HDRC Chinese is to recognize and analyze the layout, and finally detect and recognize the textlines and characters of the large historical document collection containing more than 20.000 pages kindly provided by FamilySearch.

FamilySearch-DB is a collection of Chinese manuscripts that have been chosen regarding the complexity of their layout in semantic structure and font. All manuscripts are annotated using Aletheia, an advanced system for accurate and yet cost-effective ground truthing of large amounts of documents. The annotation of the manuscripts are available in PAGE XML format, a sophisticated XML schema which is component of the PAGE (Page Analysis and Ground truth Elements) Format Framework.

We propose 3 different tasks for this competition:
Task 1: Handwritten Character Recognition on extracted textlines
Task 2: Layout Analysis on structured historical document images
Task 3: Complete, integrated textline detection and recognition on a large dataset

ICDAR 2019 HDRC Chinese is organised by the EISLAB-Machine Learning group at LTU.

Registration

Click here to send your registration by email and receive the training data.

Evaluation tools

Task 1: Available for registered participants
Task 2: Available on GitHub
Task 3: Available for registered participants

Referencing the Data

In any research publication or communication about performance results (e.g., on blogs or news articles) the source of the data has to be referred to, i.e., the ICDAR 2019 HDRC-Chinese dataset, as following:

@inproceedings{simistira2019icdar2019hdrc,
  archivePrefix = {arXiv},
  arxivId = {1903.03341},
  eprint = {1903.03341},
  author = {Saini, Rajkumar and Dobson, Derek and Morrey, Jon and Liwicki, Marcus and Simistira Liwicki, Foteini},
  title = {{ICDAR 2019 Historical Document Reading Challenge on Large Structured Chinese Family Records}},
  booktitle={{to appear in 15th International Conference on Document Analysis and Recognition (ICDAR)}},  
  year = {2019}
  month = {mar},
}


 

Frequently Asked Questions (FAQ)

Q1. For some images, the bounding boxes strictly include only text areas, while for some other images, they also contain a large amount of empty space with no text. Which kind of labeling is more advisable?
A1. The huge whitespaces will not harm the recognition scores – so your algorithm maybe be more neat or not, as you prefer. The annotations were made by human domain experts and unfortunately they sometimes differ. In the evaluation for Task 2, for example, we will not take background regions within the boxes into account; for Task 3, we will only focus on the correct transcription.

Q2. There are different annotations for the space character. Which one should we use?
A2. During evaluation, we will map all space characters to a single space character (32), so it does not matter what you report.

 

Contact

email: hdrc2019@ltu.se

Follow us on twitter @hdrc2019

Organizers

Foteini Liwicki

Foteini Liwicki, Senior Lecturer

Phone: +46 (0)920 491004
Organisation: Machine Learning, Embedded Internet Systems Lab, Department of Computer Science, Electrical and Space Engineering
Rajkumar Saini

Rajkumar Saini, Postdoc

Phone: +46 (0)920 493503
Organisation: Machine Learning, Embedded Internet Systems Lab, Department of Computer Science, Electrical and Space Engineering
Marcus Liwicki

Marcus Liwicki, Professor and Head of Subject, Chaired Professor

Phone: +46 (0)920 491006
Organisation: Machine Learning, Embedded Internet Systems Lab, Department of Computer Science, Electrical and Space Engineering