Title: Multi-page document analysis based on format consistency and clustering
Authors: Liangcai Gao, Zhi Tang, Jing Fang, Xiaofan Lin
Addresses: Institute of Computer Science & Technology, Peking University, Beijing, 100871, China. ' Institute of Computer Science & Technology, Peking University, Beijing, 100871, China. ' Institute of Computer Science & Technology, Peking University, Beijing, 100871, China. ' Vobile Incorporation, Santa Clara, California, 95054, USA
Abstract: In multi-page documents, document elements belonging to the same component usually share format regularity. We call this regularity |document component intrinsic format consistency| (DCIFC). We present a new document analysis method based on DCIFC, which is complementary to the traditional document analysis methods based on the visual characteristics of document elements. One key advantage of our method is that DCIFC is stable from document to document, and thus is not impacted by layout variability, which is a major challenge in document analysis. Our method uses clustering techniques to build statistical models and then applies the models to labelling document components. In this way, the method can adapt to specific documents using formal specificities of components. We apply our method to several document recognition tasks and show its superior performance.
Keywords: document analysis; document recognition; clustering; information retrieval; multiple pages; multi-page documents; format consistency; component labelling.
DOI: 10.1504/IJCAT.2010.034531
International Journal of Computer Applications in Technology, 2010 Vol.38 No.4, pp.306 - 315
Published online: 07 Aug 2010 *
Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article