Extraction of relevant components using shallow structure of HTML documents

Jun Zeng, Brendan Flanagan, Toshihiko Sakai, Sachio Hirokawa

    研究成果: 書籍/レポート タイプへの寄稿会議への寄与

    1 被引用数 (Scopus)

    抄録

    As the amount of web page increases, searching for semi-structured documents is gaining greater attention. The traditional approach for extracting data from web page documents is to write specialized programs, called wrappers that identify data of interest and map them to some suitable format. However, developing wrappers manually has many well known shortcomings, mainly due to the difficulty in writing and maintaining them for continually changing web data. Moreover, there is no one wrapper program that can treat all kinds of web pages. In this paper, we aim to extract relevant and meaningful snippets from as many web pages as possible, using the shallow feature of HTML documents to discover and analyze the relevant components. Also, we introduced a new feature called GAP and verified the effectiveness of GAP by conducting a SVM learning experiment.

    本文言語英語
    ホスト出版物のタイトルProceedings - 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012
    ページ1186-1190
    ページ数5
    DOI
    出版ステータス出版済み - 2012
    イベント2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012 - Chongqing, 中国
    継続期間: 5月 29 20125月 31 2012

    その他

    その他2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012
    国/地域中国
    CityChongqing
    Period5/29/125/31/12

    !!!All Science Journal Classification (ASJC) codes

    • 制御と最適化
    • 論理

    フィンガープリント

    「Extraction of relevant components using shallow structure of HTML documents」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

    引用スタイル