TY - GEN
T1 - Extraction of relevant snippets from web pages using hybrid features
AU - Zeng, Jun
AU - Wen, Junhao
AU - Xiong, Qingyu
AU - Hirokawa, Sachio
PY - 2012
Y1 - 2012
N2 - As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.
AB - As the amount of web pages increase, identifying and retrieving distinct contents from the web has increasingly become more and more difficult. The traditional approach for extracting data from web page documents is to analyze the DOM (Document Object Model) structure of a HTML page and find a common pattern. However, the number of possible DOM layout patterns is virtually infinite, which means that there is no common pattern that can be used for all kinds of web pages. In this paper, we focus on the pages that are linked to a search engine and aim to analyze the features of relevant and meaningful contents instead of a common pattern. Three features of relevant snippets are introduced. They are: quantity of text, correlation between snippet and query that is inputted into a search engine, and HTML structure. Nine parameters are used to describe the three features. Also, a SVM learning experiment is conducted to verify the effectiveness of the three features. The results show that the HTML structure feature is the most effective feature which can determine whether a snippet is relevant or not.
UR - http://www.scopus.com/inward/record.url?scp=84870794135&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84870794135&partnerID=8YFLogxK
U2 - 10.1109/IIAI-AAI.2012.50
DO - 10.1109/IIAI-AAI.2012.50
M3 - Conference contribution
AN - SCOPUS:84870794135
SN - 9780769548265
T3 - Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
SP - 209
EP - 213
BT - Proceedings of the 2012 IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
T2 - 1st IIAI International Conference on Advanced Applied Informatics, IIAIAAI 2012
Y2 - 20 September 2012 through 22 September 2012
ER -