Zachodniopomorski Uniwersytet Technologiczny w Szczecinie

WEZUT OCR Dataset

WEZUT OCR Dataset ver. 1.00 consists of 176 non-uniformly illuminated document images captured by Digital Single Lens Reflex (DSLR) camera Nikon N70 together with the reference text file 00_GT.txt containing commonly used placeholder text "Lorem ipsum". The images represent the photos of the documents printed using 5 different popular font shapes (Arial, Times New Roman, Calibri, Courier and Verdana) with some typical modifications of attributes (normal, bold and italic versions of all fonts as well as bold italics).

This dataset has been developed at the Faculty of Electrical Engineering (WE) of West Pomeranian University of Technology in Szczecin, Poland (ZUT). The authors (Hubert Michalak and Krzysztof Okarma) are with Department of Signal Processing and Multimedia Engineering (KPSiIM).

Is it intended mainly for the evaluation of image binarization algorithms, developed for the pre-processing of non-uniformly illuminated document images subjected to further text recognition using various OCR engines.

Permission to use, copy, or modify this database and its documentation for educational and research purposes only and without fee is hereby granted, provided that this copyright notice and the original authors' names appear on all copies and supporting documentation. This database shall not be modified without first obtaining written permission of the authors. The authors make no representations about the suitability of this database for any purpose. It is provided "as is" without express or implied warranty.

In case of publishing results obtained by means of the WEZUT OCR Dataset please refer to one or more of the following papers (published in Open Access mode):
  • Michalak H., Okarma K.: Robust combined binarization method of non-uniformly illuminated document images for alphanumerical character recognition. Sensors, vol. 20 no. 10, article no. 2914, 2020, DOI: 10.3390/s20102914, (plik BIBTeX)
  • Michalak H., Okarma K.: Improvement of image binarization methods using image preprocessing with local entropy filtering for alphanumerical character recognition purposes. Entropy, vol. 21 no. 6, article no. 562, 2019, DOI: 10.3390/e21060562, (BIBTeX data)
  • Michalak H., Okarma K.: Fast binarization of unevenly illuminated document images based on background estimation for optical character recognition purposes. Journal of Universal Computer Science, vol. 25 no. 6, pp. 627-646, 2019, DOI: 10.3217/jucs-025-06-062, (BIBTeX data)

Dataset file for download (ZIP)