Title: Preprocessing unstructured geographical data for public health analytics applications
Authors: Ajbal Khaoula; Housbane Samy; Khoubila Adil; Agoub Mohamed; Battas Omar; Bennani Othmani Mohammed
Addresses: Medical Informatics Department, Casablanca, 20000, Morocco; Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco ' Medical Informatics Department, Casablanca, 20000, Morocco; Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco ' Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco; Faculty of Medicine and Pharmacy, Hassan II University, Casablanca, Morocco ' Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco; Faculty of Medicine and Pharmacy, Hassan II University, Casablanca, Morocco ' Clinical Neuroscience and Mental Health Laboratory, Casablanca, Morocco; Faculty of Medicine and Pharmacy, Hassan II University, Casablanca, Morocco ' Medical Informatics Department, Casablanca, Morocco; Clinical Neuroscience and Mental Health Laboratory, Casablanca, 20000, Morocco
Abstract: Improperly pre-processed data represent a challenge throughout the subsequent phases of any data exploitation project. Hence the objective of this work is to structure and standardise specifically the address field of the University Psychiatric Centre CPU Ibn Rochd's EHR, in order to have computable data. For this purpose, we have created a transformation using a combination of Natural Language Processing algorithms referring to a local geographical repository in the ETL (extract transform load) tool used for the creation of the data warehouse. We were able to structure and standardise 1523 out of 1849, i.e., 82.37% of the addresses and consolidated the data creating the geographical dimension of the data warehouse which will be later queried for population health monitoring and datamining purposes. Unstructured data and text fields in EHR represent a drawback in any kind of data exploitation, hence, when clinically validated, NLP techniques can be used to extract relevant information.
Keywords: NLP; natural language processing; data structuring; data normalisation; ETL; extract transform load; data pre-processing; data warehouse; public health; geographical data; EHR; electronic health records; psychiatry; address; knowledge extraction.
International Journal of Data Science, 2019 Vol.4 No.4, pp.305 - 319
Received: 11 Nov 2018
Accepted: 29 Dec 2019
Published online: 22 Feb 2020 *