Automated data extraction from the web with conditional models Online publication date: Thu, 08-Dec-2005
by Xuan-Hieu Phan, Susumu Horiguchi, Tu-Bao Ho
International Journal of Business Intelligence and Data Mining (IJBIDM), Vol. 1, No. 2, 2005
Abstract: Extracting data on the Web is an important information extraction task. Most existing approaches rely on wrappers which require human knowledge and user interaction during extraction. This paper proposes the use of conditional models as an alternative solution to this task. Deriving the strength of conditional models like maximum entropy and maximum entropy Markov models, our method offers three major advantages: the full automation, the ability to incorporate various non-independent, overlapping features of different hypertext representations, and the ability to deal with missing and disordered data fields. The experimental results on a wide range of e-commercial websites with different layouts show that our method can achieve a satisfactory trade-off between automation and accuracy, and also provide a practical application of automated data extraction from the Web.
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Business Intelligence and Data Mining (IJBIDM):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email subs@inderscience.com