Semi-automatic construction of metadata from a series of web documents

Sachio Hirokawa, Eisuke Itoh, Tetsuhiro Miyahara

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    4 Citations (Scopus)


    Metadata plays an important role in discovering, collecting, extracting and aggregating Web data. This paper proposes a method of constructing metadata for a specific topic. The method uses Web pages that are located in a site and are linked from a listing page. Web pages of recipes, real estates, used cars, hotels and syllabi are typical examples of such pages. We call them a series of Web documents. A series of Web pages have the same appearance when a user views them with a browser, because it is often the case that they are written with the same tag pattern. The method uses the tag-pattern as the common structure of the Web pages. Individual contents of the pages appear as plain texts embedded between two consecutive tags. If we remove the tags, it becomes a sequence of plain texts. The plain texts in the same relative position can be interpreted as attribute values if we presume that the pages represent records of the same kind. Most of these plain texts in the same position vary page to page. But, it may happen that the same texts show up at the same relative position in almost all pages. These constant texts can be considered as attribute names. “Location”, “Rating” and “Travel from Airport” are examples of such constant texts for pages of hotel information. If the frequency of a text is higher than a threshold, we accept it as a component of metadata. If we mark a constant text with “N” and a variable text with “V”, the sequence of plain texts forms a series of N’s and V’s. A page in a series contain two kinds of NV sequence pattern. The first pattern is (NV)n, which we call vertical, where an attribute value follows the attribute name immediately. The second pattern is NnVn, which we call horizontal, where names occur in the first row and the same number of values follow in the next row. Thus we can understand the meaning of values and can construct records from a series of Web pages.

    Original languageEnglish
    Title of host publicationAI 2003
    Subtitle of host publicationAdvances in Artificial Intelligence - 16th Australian Conference on AI, Proceedings
    EditorsTamas D. Gedeon, Lance Chun Che Fung, Tamas D. Gedeon
    PublisherSpringer Verlag
    Number of pages12
    ISBN (Print)9783540206460
    Publication statusPublished - 2003
    Event16th Australian Conference on Artificial Intelligence, AI 2003 - Perth, Australia
    Duration: Dec 3 2003Dec 5 2003

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349


    Other16th Australian Conference on Artificial Intelligence, AI 2003

    All Science Journal Classification (ASJC) codes

    • Theoretical Computer Science
    • General Computer Science


    Dive into the research topics of 'Semi-automatic construction of metadata from a series of web documents'. Together they form a unique fingerprint.

    Cite this