WEZUT OCR Datasets

WEZUT OCR Dataset ver. 1.00 consists of 176 non-uniformly illuminated document images captured by Digital Single Lens Reflex (DSLR) camera Nikon N70 together with the reference text file 00_GT.txt containing commonly used placeholder text "Lorem ipsum". The images represent the photos of the documents printed using 5 different popular font shapes (Arial, Times New Roman, Calibri, Courier and Verdana) with some typical modifications of attributes (normal, bold and italic versions of all fonts as well as bold italics).

This dataset has been developed at the Faculty of Electrical Engineering (WE) of West Pomeranian University of Technology in Szczecin, Poland (ZUT). The authors (Hubert Michalak and Krzysztof Okarma) are with Department of Signal Processing and Multimedia Engineering (KPSiIM). It is intended mainly for the evaluation of image binarization algorithms, developed for the pre-processing of non-uniformly illuminated document images subjected to further text recognition using various OCR engines.

In case of publishing results obtained by means of the WEZUT OCR Dataset please refer to one or more of the following papers (published in Open Access mode):
Michalak H., Okarma K.: Robust combined binarization method of non-uniformly illuminated document images for alphanumerical character recognition. Sensors, vol. 20 no. 10, article no. 2914, 2020, DOI: 10.3390/s20102914, (BIBTeX citation data)
Michalak H., Okarma K.: Improvement of image binarization methods using image preprocessing with local entropy filtering for alphanumerical character recognition purposes. Entropy, vol. 21 no. 6, article no. 562, 2019, DOI: 10.3390/e21060562, (BIBTeX citation data)
Michalak H., Okarma K.: Fast binarization of unevenly illuminated document images based on background estimation for optical character recognition purposes. Journal of Universal Computer Science, vol. 25 no. 6, pp. 627-646, 2019, DOI: 10.3217/jucs-025-06-062, (BIBTeX citation data)

WEZUT OCR Dataset file for download (ZIP - 85.5 MB)
Mirror download linked at DIB website hosted by Centro de Informática (CIn), Universidade Federal de Pernambuco (UFPE), Brazil

WEZUT Video OCR Dataset ver. 1.00 contains 20 non-uniformly illuminated video sequences captured by Olympus Tough TG-5 12 MPix camera with Multi-motion Movie IS stabilization. Individual frames of these video files contain the images of the same placeholder text "Lorem ipsum". The dataset is split into two parts: 12 files recorded in typical non-uniform lighting conditions and 8 video sequences affected by the presence of shadows. This dataset has been developed at the Faculty of Electrical Engineering (WE) of West Pomeranian University of Technology in Szczecin, Poland (ZUT). Both authors (Piotr Lech and Krzysztof Okarma) are with Department of Signal Processing and Multimedia Engineering (KPSiIM). It is intended mainly for the evaluation of image binarization and quality assessment algorithms, developed for the pre-processing of non-uniformly illuminated document images and videos subjected to further text recognition using various OCR engines.In case of publishing results obtained by means of the WEZUT Video OCR Dataset please refer the following paper (published in Open Access mode)::

Okarma K., Lech P.: A method supporting fault-tolerant optical text recognition from video sequences recorded with handheld cameras. Engineering Applications of Artificial Intelligence, vol. 123 Part B, article no. 106330, 2023, DOI: 10.1016/j.engappai.2023.106330, (BIBTeX citation data)

WEZUT Video OCR Dataset file for download (ZIP - 1.27 GB)

Permission to use, copy, or modify these databases and its documentation for educational and research purposes only and without fee is hereby granted, provided that this copyright notice and the original authors' names appear on all copies and supporting documentation. These databases shall not be modified without first obtaining written permission of the authors. The authors make no representations about the suitability of these databases for any purpose. Each of them is provided "as is" without express or implied warranty.