Title: Accelerating the process of web page segmentation via template clustering
Authors: Jan Zeleny; Radek Burget
Addresses: Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic ' Faculty of Information Technology, Brno University of Technology, IT4Innovations Centre of Excellence, Brno, Czech Republic
Abstract: Page segmentation is often one of the initial steps when performing data mining on a web page. In the past years, several methods of page segmentation have been developed that are based on visual perception of the web page. In this paper, we propose a generic method for improving efficiency of virtually all vision-based segmentation algorithms. Our method called cluster-based page segmentation takes the widely spread concept of web templates and utilises it for improving the efficiency of vision-based page segmentation by clustering web pages and performing the segmentation on the clusters instead of each page in the cluster. To prove the efficiency of our algorithm, we offer experimental results gathered using three different vision-based segmentation algorithms.
Keywords: VIPS; page segmentation; vision-based page segmentation; web page segmentation; web page preprocessing; segmentation performance; template detection; template clustering; data mining; visual perception; web templates.
DOI: 10.1504/IJIIDS.2016.075424
International Journal of Intelligent Information and Database Systems, 2016 Vol.9 No.2, pp.134 - 154
Received: 01 Jul 2014
Accepted: 05 Aug 2015
Published online: 22 Mar 2016 *