Title: Preprocessing unstructured geographical data for public health analytics applications

Authors: Ajbal Khaoula; Housbane Samy; Khoubila Adil; Agoub Mohamed; Battas Omar; Bennani Othmani Mohammed

Addresses: Medical Informatics Department, Casablanca, 20000, Morocco; Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco ' Medical Informatics Department, Casablanca, 20000, Morocco; Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco ' Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco; Faculty of Medicine and Pharmacy, Hassan II University, Casablanca, Morocco ' Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco; Faculty of Medicine and Pharmacy, Hassan II University, Casablanca, Morocco ' Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco; Faculty of Medicine and Pharmacy, Hassan II University, Casablanca, Morocco ' Medical Informatics Department, Casablanca, Morocco; Clinical Neuroscience and Mental Health Laboratory, Casablanca, 20000, Morocco

Abstract: Improperly pre-processed data represent a challenge throughout the subsequent phases of any data exploitation project. Hence the objective of this work is to structure and standardise specifically the address field of the University Psychiatric Centre CPU Ibn Rochd's EHR, in order to have computable data. For this purpose, we have created a transformation using a combination of Natural Language Processing algorithms referring to a local geographical repository in the ETL (extract transform load) tool used for the creation of the data warehouse. We were able to structure and standardise 1523 out of 1849, i.e., 82.37% of the addresses and consolidated the data creating the geographical dimension of the data warehouse which will be later queried for population health monitoring and datamining purposes. Unstructured data and text fields in EHR represent a drawback in any kind of data exploitation, hence, when clinically validated, NLP techniques can be used to extract relevant information.

Keywords: NLP; natural language processing; data structuring; data normalisation; ETL; extract transform load; data pre-processing; data warehouse; public health; geographical data; EHR; electronic health records; psychiatry; address; knowledge extraction.

DOI: 10.1504/IJDS.2019.105263

International Journal of Data Science, 2019 Vol.4 No.4, pp.305 - 319

Received: 11 Nov 2018
Accepted: 29 Dec 2019

Published online: 22 Feb 2020 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article