Title: Extraction of breast cancer biomarker status using natural language processing
Authors: Paul Dexter; Jinghua He; Jarod Baker; George Eckert; Abby Church; Ning Jackie Zhang
Addresses: Indiana University School of Medicine, Indianapolis IN, USA; Regenstrief Institute, Indianapolis IN, USA; Eskenazi Health, Indianapolis IN, USA ' Merck & Co., Inc., Kenilworth, NJ, USA ' Regenstrief Institute, Indianapolis IN, USA ' Indiana University School of Medicine, Indianapolis IN, USA ' Indiana University Health, Indianapolis IN, USA ' Seton Hal University, South Orange, NJ, USA
Abstract: We employed natural language processing (NLP) algorithms to extract estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor 2 (HER2) receptor status for females with breast cancer using unstructured (free text) EMR data, and to determine the prevalence of triple negative breast cancer in the Indiana network for patient care (INPC) population. We identified female patients in INPC with a history of breast cancer over a ten year period who had at least five oncology notes or one related pathology document. Based on manual chart review, our NLP algorithms for extracting ER, PR, and HER2 receptor status performed well with sensitivity 87.5% to 92.6%, specificity 88.6% to 95.8%, positive predictive values (PPV) 82.4% to 99.0%, and negative predictive values (NPV) 85.2% to 97.7%. This study confirmed our primary hypothesis that NLP algorithms are effective in identifying important breast cancer biomarkers in patients with breast cancer using unstructured data.
Keywords: NLP algorithms; effective; breast cancer biomarkers; breast cancer.
DOI: 10.1504/IJCMH.2019.104365
International Journal of Computational Medicine and Healthcare, 2019 Vol.1 No.1, pp.112 - 120
Published online: 06 Jan 2020 *
Full-text access for editors Full-text access for subscribers Free access Comment on this article