TY - GEN
T1 - Extraction of relevant components using shallow structure of HTML documents
AU - Zeng, Jun
AU - Flanagan, Brendan
AU - Sakai, Toshihiko
AU - Hirokawa, Sachio
PY - 2012
Y1 - 2012
N2 - As the amount of web page increases, searching for semi-structured documents is gaining greater attention. The traditional approach for extracting data from web page documents is to write specialized programs, called wrappers that identify data of interest and map them to some suitable format. However, developing wrappers manually has many well known shortcomings, mainly due to the difficulty in writing and maintaining them for continually changing web data. Moreover, there is no one wrapper program that can treat all kinds of web pages. In this paper, we aim to extract relevant and meaningful snippets from as many web pages as possible, using the shallow feature of HTML documents to discover and analyze the relevant components. Also, we introduced a new feature called GAP and verified the effectiveness of GAP by conducting a SVM learning experiment.
AB - As the amount of web page increases, searching for semi-structured documents is gaining greater attention. The traditional approach for extracting data from web page documents is to write specialized programs, called wrappers that identify data of interest and map them to some suitable format. However, developing wrappers manually has many well known shortcomings, mainly due to the difficulty in writing and maintaining them for continually changing web data. Moreover, there is no one wrapper program that can treat all kinds of web pages. In this paper, we aim to extract relevant and meaningful snippets from as many web pages as possible, using the shallow feature of HTML documents to discover and analyze the relevant components. Also, we introduced a new feature called GAP and verified the effectiveness of GAP by conducting a SVM learning experiment.
UR - https://www.scopus.com/pages/publications/84872946925
UR - https://www.scopus.com/pages/publications/84872946925#tab=citedBy
U2 - 10.1109/FSKD.2012.6234295
DO - 10.1109/FSKD.2012.6234295
M3 - Conference contribution
AN - SCOPUS:84872946925
SN - 9781467300223
T3 - Proceedings - 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012
SP - 1186
EP - 1190
BT - Proceedings - 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012
T2 - 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012
Y2 - 29 May 2012 through 31 May 2012
ER -