Rescuing Historical Climate Observations to Support Hydrological Research: A Case Study of Solar Radiation Data

Odunayo, Ogundepo; Sookoo, Naveela; Bathla, Gautam; Cavallin, Anthony; Persaud, Bhaleka; Szigeti, Kathy; Van Cappellen, Philippe; Lin, Jimmy

dc.contributor.author	Odunayo, Ogundepo
dc.contributor.author	Sookoo, Naveela
dc.contributor.author	Bathla, Gautam
dc.contributor.author	Cavallin, Anthony
dc.contributor.author	Persaud, Bhaleka
dc.contributor.author	Szigeti, Kathy
dc.contributor.author	Van Cappellen, Philippe
dc.contributor.author	Lin, Jimmy
dc.date.accessioned	2021-09-07 13:50:12 (GMT)
dc.date.available	2021-09-07 13:50:12 (GMT)
dc.date.issued	2021-08-16
dc.identifier.uri	https://doi.org/10.1145/3469096.3474929
dc.identifier.uri	http://hdl.handle.net/10012/17344
dc.description.abstract	The acceleration of climate change and its impact highlight the need for long-term reliable climate data at high spatiotemporal resolution to answer key science questions in cold regions hydrology. Prior to the digital age, climate records were archived on paper. For example, from the 1950s to the 1990s, solar radiation data from recording stations worldwide were published in booklets by the former Union of Soviet Socialist Republics (USSR) Hydrometeorological Service. As a result, the data are not easily accessible by most researchers. The overarching aim of this research is to develop techniques to convert paper-based climate records into a machine-readable format to support environmental research in cold regions. This study compares the performance of a proprietary optical character recognition (OCR) service with an open-source OCR tool for digitizing hydrometeorological data. We built a digitization pipeline combining different image preprocessing techniques, semantic segmentation, and an open-source OCR engine for extracting data and metadata recorded in the scanned documents. Each page contains blocks of text with station names and tables containing the climate data. The process begins with image preprocessing to reduce noise and to improve quality before the page content is segmented to detect tables and finally run through an OCR engine for text extraction. We outline the digitization process and report on initial results, including different segmentation approaches, preprocessing image algorithms, and OCR techniques to ensure accurate extraction and organization of relevant metadata from thousands of scanned climate records. We evaluated the performance of Tesseract OCR and ABBYY FineReader on text extraction. We find that although ABBY FineReader has better accuracy on the sample data, our custom extraction pipeline using Tesseract is efficient and scalable because it is flexible and allows for more customization.	en
dc.description.sponsorship	This work was partially funded by the Canada First Research Excellence Fund’s Global Water Futures Programme.	en
dc.language.iso	en	en
dc.publisher	ACM	en
dc.relation.ispartofseries	DocEng '21: Proceedings of the 21st ACM Symposium on Document Engineering;Article No.: 19
dc.subject	data rescue	en
dc.subject	data digitization	en
dc.subject	optical character recognition	en
dc.subject	OCR	en
dc.subject	page segmentation	en
dc.subject	table detection	en
dc.subject	climate	en
dc.subject	solar radiation	en
dc.title	Rescuing Historical Climate Observations to Support Hydrological Research: A Case Study of Solar Radiation Data	en
dc.type	Preprint	en
dcterms.bibliographicCitation	Ogundepo Odunayo, Naveela N. Sookoo, Gautam Bathla, Anthony Cavallin, Bhaleka D. Persaud, Kathy Szigeti, Philippe Van Cappellen, and Jimmy Lin. 2021. Rescuing historical climate observations to support hydrological research: a case study of solar radiation data. In Proceedings of the 21st ACM Symposium on Document Engineering (DocEng '21). Association for Computing Machinery, New York, NY, USA, Article 19, 1–4. DOI:https://doi.org/10.1145/3469096.3474929	en
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.contributor.affiliation1	Faculty of Science	en
uws.contributor.affiliation2	David R. Cheriton School of Computer Science	en
uws.contributor.affiliation2	Earth and Environmental Sciences	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Reviewed	en
uws.scholarLevel	Faculty	en

Files in this item

Name:: 56_Ogundepo.pdf
Size:: 2.354Mb
Format:: PDF
Description:: Preprint article

View/ Open

This item appears in the following Collection(s)

Show simple item record