Automatic wrapper generation for multilingual web resources

Yasuhiro Yamada, Daisuke Ikeda, Sachio Hirokawa

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)


We present a wrapper generation system to extract contents of semi-structured documents which contain instances of a record. The generation is done automatically using general assumptions on the structure of instances. It outputs a set of pairs of left and right delimiters surrounding instances of a field. In addition to input documents, our system also receives a set of symbols with which a delimiter must begin or end. Our system treats semi-structured documents just as strings so that it does not depend on markup and natural languages. It does not require any training examples which show where instances are. We show experimental results on both static and dynamic pages which are gathered from 13 Web sites, markuped in HTML or XML, and written in four natural languages. In addition to usual contents, generated wrappers extract useful information hidden in comments or tags which are ignored by other wrapper generation algorithms. Some generated delimiters contain whitespaces or multibyte characters.

Original languageEnglish
Title of host publicationDiscovery Science - 5th International Conference, DS 2002, Proceedings
EditorsSteffen Lange, Ken Satoh, Carl H. Smith
PublisherSpringer Verlag
Number of pages8
ISBN (Print)3540001883, 9783540001881
Publication statusPublished - 2002
Event5th International Conference on Discovery Science, DS 2002 - Lubeck, Germany
Duration: Nov 24 2002Nov 26 2002

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other5th International Conference on Discovery Science, DS 2002

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'Automatic wrapper generation for multilingual web resources'. Together they form a unique fingerprint.

Cite this